├── Article 14 One.JPG
├── Article 14 Two.JPG
├── README.md
└── Untitled14.ipynb


/Article 14 One.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LSEG-API-Samples/Article.EikonAPI.Python.NewsSentimentAnalysis/48c65e15292c3ff1cf2610a61587565fc985fd3b/Article 14 One.JPG


--------------------------------------------------------------------------------
/Article 14 Two.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LSEG-API-Samples/Article.EikonAPI.Python.NewsSentimentAnalysis/48c65e15292c3ff1cf2610a61587565fc985fd3b/Article 14 Two.JPG


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Introduction to News Sentiment Analysis with Eikon Data APIs - a Python example
  3 | 
  4 | This article will demonstrate how we can conduct a simple sentiment analysis of news delivered via our new [Eikon Data APIs](https://developers.thomsonreuters.com/eikon-data-apis). Natural Language Processing (NLP) is a big area of interest for those looking to gain insight and new sources of value from the vast quantities of unstructured data out there. The area is quite complex and there are many resources online that can help you familiarise yourself with this very interesting area. There are also many different packages that can help you as well as many different approaches to this problem. Whilst these are beyond the scope of this article - I will go through a simple implementation which will give you a swift enough introduction and practical codebase for further exploration and learning.
  5 | 
  6 | **Pre-requisites:** 
  7 | 
  8 | **Thomson Reuters Eikon** with access to new [Eikon Data APIs](https://developers.thomsonreuters.com/eikon-data-apis)
  9 | 
 10 | **Python 2.x/3.x**
 11 | 
 12 | **Required Python Packages:** eikon, pandas, numpy, beautifulsoup, textblob, datetime 
 13 | 
 14 | **Required corpora download:** >>>python -m textblob.download_corpora (this is required by the sentiment engine to generate sentiment)
 15 | 
 16 | ### Introduction
 17 | 
 18 | NLP is a field which enables computers to understand human language (voice or text). This is quite a big area of research and a little enquiry on your part will furnish you with the complexities of this problem set. Here we will be focussing on one application of this called *Sentiment Analysis*. In our case we will be taking news articles(unstructured text) for a particular company, **IBM**, and we will attempt to grade this news to see how postive, negative or neutral it is. We will then try to see if this news has had an impact on the shareprice of **IBM**. 
 19 | 
 20 | To do this really well is a non-trivial task, and most universtities and financial companies will have departments and teams looking at this. We ourselves provide machine readable news products with News Analytics (such as sentiment) over our **Elektron** platform in realtime at very low latency - these products are essentially consumed by *algorithmic applications* as opposed to *humans*. 
 21 | 
 22 | We will try to do a similar thing as simply as possible to illustrate the key elements - our task is significantly eased by not having to do this in a low latency environment. We will be abstracting most of the complexities to do with the mechanics of actually analysing the text to various packages. You can then easily replace the modules such as the sentiment engine etc to improve your results as your understanding increases.  
 23 | 
 24 | So lets get started. First lets load the packages that we will need to use and set our app_id. 
 25 | 
 26 | 
 27 | ```python
 28 | import eikon as ek
 29 | import pandas as pd
 30 | import numpy as np
 31 | from bs4 import BeautifulSoup
 32 | from textblob import TextBlob
 33 | import datetime
 34 | from datetime import time
 35 | import warnings
 36 | warnings.filterwarnings("ignore")
 37 | ek.set_app_id('YOUR APP ID HERE')
 38 | ```
 39 | 
 40 | There are two API calls for news:
 41 | 
 42 | **get_news_headlines** : returns a list of news headlines satisfying a query
 43 | 
 44 | **get_news_story** : returns a HTML representation of the full news article
 45 | 
 46 | We will need to use both - thankfully they are really straightforward to use. We will need to use **get_news_headlines** API call to request a list of headlines. The first parameter for this call is a query. You dont really need to know this query language as you can generate it using the **News Monitor App** (type **NEWS** into Eikon search bar) in **Eikon**. 
 47 | 
 48 | You can see here I have just typed in 2 search terms, **IBM**, for the company, and, **English**, for the language I am interested in (in our example we will only be able to analyse English language text - though there are corpora, packages, methods you can employ to target other languages - though these are beyond the scope of this article). You can of course use any search terms you wish.
 49 | 
 50 | ![News App 1](Article 14 One.jpg)
 51 | 
 52 | After you have typed in what you want to search for - we can simply click in the search box and this will then generate the query text which we can then copy and paste into the API call below. Its easy for us to change logical operations such as **AND** to **OR**, **NOT**  to suit our query. 
 53 | 
 54 | ![News App 2](Article 14 Two.png)
 55 | 
 56 | So the line of code below gets us 100 news headlines for **IBM** in english prior to 4th Dec 2017, and stores them in a dataframe, df for us.
 57 | 
 58 | 
 59 | ```python
 60 | df = ek.get_news_headlines('R:IBM.N AND Language:LEN', date_to = "2017-12-04", count=100)
 61 | df.head()
 62 | ```
 63 | 
 64 | 
 65 | 
 66 | 
 67 | <div>
 68 | <table border="1" class="dataframe">
 69 |   <thead>
 70 |     <tr style="text-align: right;">
 71 |       <th></th>
 72 |       <th>versionCreated</th>
 73 |       <th>text</th>
 74 |       <th>storyId</th>
 75 |       <th>sourceCode</th>
 76 |     </tr>
 77 |   </thead>
 78 |   <tbody>
 79 |     <tr>
 80 |       <th>2017-12-01 23:11:47.374</th>
 81 |       <td>2017-12-01 23:11:47.374</td>
 82 |       <td>Reuters Insider - FM Final Trade: HAL, TWTR &amp; ...</td>
 83 |       <td>urn:newsml:reuters.com:20171201:nRTV8KBb1N:1</td>
 84 |       <td>NS:CNBC</td>
 85 |     </tr>
 86 |     <tr>
 87 |       <th>2017-12-01 19:19:20.279</th>
 88 |       <td>2017-12-01 19:19:20.279</td>
 89 |       <td>IBM ST: the upside prevails as long as 150.2 i...</td>
 90 |       <td>urn:newsml:reuters.com:20171201:nGUR2R6xQ:1</td>
 91 |       <td>NS:GURU</td>
 92 |     </tr>
 93 |     <tr>
 94 |       <th>2017-12-01 18:12:41.143</th>
 95 |       <td>2017-12-01 18:12:41.143</td>
 96 |       <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>
 97 |       <td>urn:newsml:reuters.com:20171201:nEOL6JJfRT:1</td>
 98 |       <td>NS:EDG</td>
 99 |     </tr>
100 |     <tr>
101 |       <th>2017-12-01 18:12:41.019</th>
102 |       <td>2017-12-01 18:12:41.019</td>
103 |       <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>
104 |       <td>urn:newsml:reuters.com:20171201:nEOL3YHcVY:1</td>
105 |       <td>NS:EDG</td>
106 |     </tr>
107 |     <tr>
108 |       <th>2017-12-01 18:06:03.633</th>
109 |       <td>2017-12-01 18:06:03.633</td>
110 |       <td>Moody's Affirms Seven Classes of GSMS 2016-GS4</td>
111 |       <td>urn:newsml:reuters.com:20171201:nMDY7wGNTP:1</td>
112 |       <td>NS:RTRS</td>
113 |     </tr>
114 |   </tbody>
115 | </table>
116 | </div>
117 | 
118 | 
119 | 
120 | I will just add 3 new columns which we will need to store some variables in later.
121 | 
122 | 
123 | ```python
124 | df['Polarity'] = np.nan
125 | df['Subjectivity'] = np.nan
126 | df['Score'] = np.nan
127 | ```
128 | 
129 | So we have our frame with the most recent 100 news headline items. The headline is stored in the **text** column and the storyID which we will now use to pull down the actual articles themselves, is stored in the **storyID** column. 
130 | 
131 | We will now iterate through the headline dataframe and pull down the news articles using the second of our news API calls, get_news_story. We simply pass the **storyID** to this API call and we are returned a HTML representation of the article - which allows you to render them nicely etc - however for our purposes we want to strip the HTML tags etc out and just be left with the plain text - as we dont want to analyse HTML tags for sentiment. We will do this using the excellent **BeautifulSoup** package.
132 | 
133 | Once we have the text of these articles we can pass them to our sentiment engine which will give us a sentiment score for each article. So what is our sentiment engine? We will be using the simple **TextBlob** package to demo a rudimentary process to show you how things work. **TextBlob** is a higher level abstraction package that sits on top of **NLTK** (Natural Language Toolkit) which is a widely used package for this type of task. 
134 | 
135 | **NLTK** is quite a complex package which gives you a lot of control over the whole analytical process - but the cost of that is complexity and required knowledge of the steps invloved. **TextBlob** shields us from this complexity, but we should at some stage understand what is going on under the hood. Thankfully there is plenty of information to guide us in this. We will be implementing the default **PatternAnalyzer** which is based on the popular **Pattern** library though there is also a **NaiveBayesAnalyzer** which is a **NLTK** classifier based on a movie review corpus. 
136 | 
137 | All of this can be achieved in just a few lines of code. This is quite a dense codeblock - so I have commented the key steps.  
138 | 
139 | 
140 | ```python
141 | for idx, storyId in enumerate(df['storyId'].values):  #for each row in our df dataframe
142 |     newsText = ek.get_news_story(storyId) #get the news story
143 |     if newsText:
144 |         soup = BeautifulSoup(newsText,"lxml") #create a BeautifulSoup object from our HTML news article
145 |         sentA = TextBlob(soup.get_text()) #pass the text only article to TextBlob to anaylse
146 |         df['Polarity'].iloc[idx] = sentA.sentiment.polarity #write sentiment polarity back to df
147 |         df['Subjectivity'].iloc[idx] = sentA.sentiment.subjectivity #write sentiment subjectivity score back to df
148 |         if sentA.sentiment.polarity >= 0.05: # attribute bucket to sentiment polartiy
149 |             score = 'positive'
150 |         elif  -.05 < sentA.sentiment.polarity < 0.05:
151 |             score = 'neutral'
152 |         else:
153 |             score = 'negative'
154 |         df['Score'].iloc[idx] = score #write score back to df
155 | df.head()
156 | ```
157 | 
158 | 
159 | 
160 | 
161 | <div>
162 | <table border="1" class="dataframe">
163 |   <thead>
164 |     <tr style="text-align: right;">
165 |       <th></th>
166 |       <th>versionCreated</th>
167 |       <th>text</th>
168 |       <th>storyId</th>
169 |       <th>sourceCode</th>
170 |       <th>Polarity</th>
171 |       <th>Subjectivity</th>
172 |       <th>Score</th>
173 |     </tr>
174 |   </thead>
175 |   <tbody>
176 |     <tr>
177 |       <th>2017-12-01 23:11:47.374</th>
178 |       <td>2017-12-01 23:11:47.374</td>
179 |       <td>Reuters Insider - FM Final Trade: HAL, TWTR &amp; ...</td>
180 |       <td>urn:newsml:reuters.com:20171201:nRTV8KBb1N:1</td>
181 |       <td>NS:CNBC</td>
182 |       <td>0.066667</td>
183 |       <td>0.566667</td>
184 |       <td>positive</td>
185 |     </tr>
186 |     <tr>
187 |       <th>2017-12-01 19:19:20.279</th>
188 |       <td>2017-12-01 19:19:20.279</td>
189 |       <td>IBM ST: the upside prevails as long as 150.2 i...</td>
190 |       <td>urn:newsml:reuters.com:20171201:nGUR2R6xQ:1</td>
191 |       <td>NS:GURU</td>
192 |       <td>0.055260</td>
193 |       <td>0.320844</td>
194 |       <td>positive</td>
195 |     </tr>
196 |     <tr>
197 |       <th>2017-12-01 18:12:41.143</th>
198 |       <td>2017-12-01 18:12:41.143</td>
199 |       <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>
200 |       <td>urn:newsml:reuters.com:20171201:nEOL6JJfRT:1</td>
201 |       <td>NS:EDG</td>
202 |       <td>0.000000</td>
203 |       <td>0.000000</td>
204 |       <td>neutral</td>
205 |     </tr>
206 |     <tr>
207 |       <th>2017-12-01 18:12:41.019</th>
208 |       <td>2017-12-01 18:12:41.019</td>
209 |       <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>
210 |       <td>urn:newsml:reuters.com:20171201:nEOL3YHcVY:1</td>
211 |       <td>NS:EDG</td>
212 |       <td>0.000000</td>
213 |       <td>0.000000</td>
214 |       <td>neutral</td>
215 |     </tr>
216 |     <tr>
217 |       <th>2017-12-01 18:06:03.633</th>
218 |       <td>2017-12-01 18:06:03.633</td>
219 |       <td>Moody's Affirms Seven Classes of GSMS 2016-GS4</td>
220 |       <td>urn:newsml:reuters.com:20171201:nMDY7wGNTP:1</td>
221 |       <td>NS:RTRS</td>
222 |       <td>0.175000</td>
223 |       <td>0.325000</td>
224 |       <td>positive</td>
225 |     </tr>
226 |   </tbody>
227 | </table>
228 | </div>
229 | 
230 | 
231 | 
232 | Looking at our dataframe we can now see 3 new columns on the right, *Polarity*, *Subjectivity* and *Score*. As we have seen *Polarity* is the actual sentiment polarity returned from **TextBlob** (ranging from -1(negative) to +1(positive), *Subjectivity* is a measure (ranging from 0 to 1) where 0 is very objective and 1 is very subjective, and *Score* is simply a Positive, Negative or Neutral rating based on the strength of the polarities. 
233 | 
234 | We would now like to see what, if any, impact this news has had on the shareprice of **IBM**. There are many ways of doing this - but to make things simple, I would like to see what the average return is at various points in time **AFTER** the news has broken. I want to check if there are *aggregate differences* in the *average returns* from the Positive, Neutral and Negative buckets we created earlier.
235 | 
236 | 
237 | ```python
238 | start = df['versionCreated'].min().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
239 | end = df['versionCreated'].max().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
240 | Minute = ek.get_timeseries(["IBM.N"], start_date=start, interval="minute")
241 | Minute.tail()
242 | ```
243 | 
244 | 
245 | 
246 | 
247 | <div>
248 | <table border="1" class="dataframe">
249 |   <thead>
250 |     <tr style="text-align: right;">
251 |       <th>IBM.N</th>
252 |       <th>HIGH</th>
253 |       <th>LOW</th>
254 |       <th>OPEN</th>
255 |       <th>CLOSE</th>
256 |       <th>COUNT</th>
257 |       <th>VOLUME</th>
258 |     </tr>
259 |     <tr>
260 |       <th>Date</th>
261 |       <th></th>
262 |       <th></th>
263 |       <th></th>
264 |       <th></th>
265 |       <th></th>
266 |       <th></th>
267 |     </tr>
268 |   </thead>
269 |   <tbody>
270 |     <tr>
271 |       <th>2018-01-05 15:21:00</th>
272 |       <td>162.32</td>
273 |       <td>162.18</td>
274 |       <td>162.22</td>
275 |       <td>162.31</td>
276 |       <td>23.0</td>
277 |       <td>3073.0</td>
278 |     </tr>
279 |     <tr>
280 |       <th>2018-01-05 15:22:00</th>
281 |       <td>162.42</td>
282 |       <td>162.29</td>
283 |       <td>162.31</td>
284 |       <td>162.42</td>
285 |       <td>23.0</td>
286 |       <td>2442.0</td>
287 |     </tr>
288 |     <tr>
289 |       <th>2018-01-05 15:23:00</th>
290 |       <td>162.46</td>
291 |       <td>162.43</td>
292 |       <td>162.45</td>
293 |       <td>162.46</td>
294 |       <td>11.0</td>
295 |       <td>960.0</td>
296 |     </tr>
297 |     <tr>
298 |       <th>2018-01-05 15:24:00</th>
299 |       <td>162.46</td>
300 |       <td>162.40</td>
301 |       <td>162.46</td>
302 |       <td>162.40</td>
303 |       <td>5.0</td>
304 |       <td>505.0</td>
305 |     </tr>
306 |     <tr>
307 |       <th>2018-01-05 15:25:00</th>
308 |       <td>162.39</td>
309 |       <td>162.31</td>
310 |       <td>162.36</td>
311 |       <td>162.33</td>
312 |       <td>12.0</td>
313 |       <td>1060.0</td>
314 |     </tr>
315 |   </tbody>
316 | </table>
317 | </div>
318 | 
319 | 
320 | 
321 | We will need to create some new columns for the next part of this analysis.
322 | 
323 | 
324 | ```python
325 | df['twoM'] = np.nan
326 | df['fiveM'] = np.nan
327 | df['tenM'] = np.nan
328 | df['thirtyM'] = np.nan
329 | df.head(2)
330 | ```
331 | 
332 | 
333 | 
334 | 
335 | <div>
336 | <table border="1" class="dataframe">
337 |   <thead>
338 |     <tr style="text-align: right;">
339 |       <th></th>
340 |       <th>versionCreated</th>
341 |       <th>text</th>
342 |       <th>storyId</th>
343 |       <th>sourceCode</th>
344 |       <th>Polarity</th>
345 |       <th>Subjectivity</th>
346 |       <th>Score</th>
347 |       <th>twoM</th>
348 |       <th>fiveM</th>
349 |       <th>tenM</th>
350 |       <th>thirtyM</th>
351 |     </tr>
352 |   </thead>
353 |   <tbody>
354 |     <tr>
355 |       <th>2017-12-01 23:11:47.374</th>
356 |       <td>2017-12-01 23:11:47.374</td>
357 |       <td>Reuters Insider - FM Final Trade: HAL, TWTR &amp; ...</td>
358 |       <td>urn:newsml:reuters.com:20171201:nRTV8KBb1N:1</td>
359 |       <td>NS:CNBC</td>
360 |       <td>0.066667</td>
361 |       <td>0.566667</td>
362 |       <td>positive</td>
363 |       <td>NaN</td>
364 |       <td>NaN</td>
365 |       <td>NaN</td>
366 |       <td>NaN</td>
367 |     </tr>
368 |     <tr>
369 |       <th>2017-12-01 19:19:20.279</th>
370 |       <td>2017-12-01 19:19:20.279</td>
371 |       <td>IBM ST: the upside prevails as long as 150.2 i...</td>
372 |       <td>urn:newsml:reuters.com:20171201:nGUR2R6xQ:1</td>
373 |       <td>NS:GURU</td>
374 |       <td>0.055260</td>
375 |       <td>0.320844</td>
376 |       <td>positive</td>
377 |       <td>NaN</td>
378 |       <td>NaN</td>
379 |       <td>NaN</td>
380 |       <td>NaN</td>
381 |     </tr>
382 |   </tbody>
383 | </table>
384 | </div>
385 | 
386 | 
387 | 
388 | OK so I now just need to get the timestamp of each news item, truncate it to minute data (ie remove second and microsecond components) and get the base shareprice of **IBM** at that time, and at several itervals after that time, in our case *t+2 mins,t+5 mins, t+10 mins, t+30 mins*, calculating the % change for each interval. 
389 | 
390 | An important point to bear in mind here is that news can be generated at anytime - 24 hours a day - outside of normal market hours. So for news generated outside normal market hours for **IBM** in our case, we would have to wait until the next market opening to conduct our calculations. Of course there are a number of issues here concerning our ability to attribute price movement to our news item in isolation (basically we cannot). That said, there might be other ways of doing this - for example looking at **GDRs/ADRs** or surrogates etc - these are beyond the scope of this introductory article. In our example, these news items are simply discarded. 
391 | 
392 | We will now loop through each news item in the dataframe, calculate (where possible) and store the derived performance numbers in the columns we created earlier: twoM...thirtyM.
393 | 
394 | 
395 | ```python
396 | for idx, newsDate in enumerate(df['versionCreated'].values):
397 |     sTime = df['versionCreated'][idx]
398 |     sTime = sTime.replace(second=0,microsecond=0)
399 |     try:
400 |         t0 = Minute.iloc[Minute.index.get_loc(sTime),2]
401 |         df['twoM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=2))),3]/(t0)-1)*100)
402 |         df['fiveM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=5))),3]/(t0)-1)*100)
403 |         df['tenM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100) 
404 |         df['thirtyM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=30))),3]/(t0)-1)*100)
405 |     except:
406 |         pass
407 | df.head()
408 | ```
409 | 
410 | 
411 | 
412 | 
413 | <div>
414 | <table border="1" class="dataframe">
415 |   <thead>
416 |     <tr style="text-align: right;">
417 |       <th></th>
418 |       <th>versionCreated</th>
419 |       <th>text</th>
420 |       <th>storyId</th>
421 |       <th>sourceCode</th>
422 |       <th>Polarity</th>
423 |       <th>Subjectivity</th>
424 |       <th>Score</th>
425 |       <th>twoM</th>
426 |       <th>fiveM</th>
427 |       <th>tenM</th>
428 |       <th>thirtyM</th>
429 |     </tr>
430 |   </thead>
431 |   <tbody>
432 |     <tr>
433 |       <th>2017-12-01 23:11:47.374</th>
434 |       <td>2017-12-01 23:11:47.374</td>
435 |       <td>Reuters Insider - FM Final Trade: HAL, TWTR &amp; ...</td>
436 |       <td>urn:newsml:reuters.com:20171201:nRTV8KBb1N:1</td>
437 |       <td>NS:CNBC</td>
438 |       <td>0.066667</td>
439 |       <td>0.566667</td>
440 |       <td>positive</td>
441 |       <td>NaN</td>
442 |       <td>NaN</td>
443 |       <td>NaN</td>
444 |       <td>NaN</td>
445 |     </tr>
446 |     <tr>
447 |       <th>2017-12-01 19:19:20.279</th>
448 |       <td>2017-12-01 19:19:20.279</td>
449 |       <td>IBM ST: the upside prevails as long as 150.2 i...</td>
450 |       <td>urn:newsml:reuters.com:20171201:nGUR2R6xQ:1</td>
451 |       <td>NS:GURU</td>
452 |       <td>0.055260</td>
453 |       <td>0.320844</td>
454 |       <td>positive</td>
455 |       <td>0.071119</td>
456 |       <td>0.084050</td>
457 |       <td>0.000000</td>
458 |       <td>-0.109911</td>
459 |     </tr>
460 |     <tr>
461 |       <th>2017-12-01 18:12:41.143</th>
462 |       <td>2017-12-01 18:12:41.143</td>
463 |       <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>
464 |       <td>urn:newsml:reuters.com:20171201:nEOL6JJfRT:1</td>
465 |       <td>NS:EDG</td>
466 |       <td>0.000000</td>
467 |       <td>0.000000</td>
468 |       <td>neutral</td>
469 |       <td>0.012944</td>
470 |       <td>-0.090609</td>
471 |       <td>-0.032360</td>
472 |       <td>0.148858</td>
473 |     </tr>
474 |     <tr>
475 |       <th>2017-12-01 18:12:41.019</th>
476 |       <td>2017-12-01 18:12:41.019</td>
477 |       <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>
478 |       <td>urn:newsml:reuters.com:20171201:nEOL3YHcVY:1</td>
479 |       <td>NS:EDG</td>
480 |       <td>0.000000</td>
481 |       <td>0.000000</td>
482 |       <td>neutral</td>
483 |       <td>0.012944</td>
484 |       <td>-0.090609</td>
485 |       <td>-0.032360</td>
486 |       <td>0.148858</td>
487 |     </tr>
488 |     <tr>
489 |       <th>2017-12-01 18:06:03.633</th>
490 |       <td>2017-12-01 18:06:03.633</td>
491 |       <td>Moody's Affirms Seven Classes of GSMS 2016-GS4</td>
492 |       <td>urn:newsml:reuters.com:20171201:nMDY7wGNTP:1</td>
493 |       <td>NS:RTRS</td>
494 |       <td>0.175000</td>
495 |       <td>0.325000</td>
496 |       <td>positive</td>
497 |       <td>0.097238</td>
498 |       <td>0.155581</td>
499 |       <td>0.097238</td>
500 |       <td>0.246337</td>
501 |     </tr>
502 |   </tbody>
503 | </table>
504 | </div>
505 | 
506 | 
507 | 
508 | Fantastic - we have now completed the analytical part of our study. Finally, we just need to aggregate our results by *Score* bucket in order to draw some conclusions. 
509 | 
510 | 
511 | ```python
512 | grouped = df.groupby(['Score']).mean()
513 | grouped
514 | ```
515 | 
516 | 
517 | 
518 | 
519 | <div>
520 | <table border="1" class="dataframe">
521 |   <thead>
522 |     <tr style="text-align: right;">
523 |       <th></th>
524 |       <th>Polarity</th>
525 |       <th>Subjectivity</th>
526 |       <th>twoM</th>
527 |       <th>fiveM</th>
528 |       <th>tenM</th>
529 |       <th>thirtyM</th>
530 |     </tr>
531 |     <tr>
532 |       <th>Score</th>
533 |       <th></th>
534 |       <th></th>
535 |       <th></th>
536 |       <th></th>
537 |       <th></th>
538 |       <th></th>
539 |     </tr>
540 |   </thead>
541 |   <tbody>
542 |     <tr>
543 |       <th>negative</th>
544 |       <td>-0.146508</td>
545 |       <td>0.316746</td>
546 |       <td>NaN</td>
547 |       <td>NaN</td>
548 |       <td>NaN</td>
549 |       <td>NaN</td>
550 |     </tr>
551 |     <tr>
552 |       <th>neutral</th>
553 |       <td>0.006436</td>
554 |       <td>0.175766</td>
555 |       <td>-0.004829</td>
556 |       <td>-0.009502</td>
557 |       <td>0.028979</td>
558 |       <td>0.137544</td>
559 |     </tr>
560 |     <tr>
561 |       <th>positive</th>
562 |       <td>0.129260</td>
563 |       <td>0.406868</td>
564 |       <td>0.012089</td>
565 |       <td>0.012776</td>
566 |       <td>0.035936</td>
567 |       <td>0.047345</td>
568 |     </tr>
569 |   </tbody>
570 | </table>
571 | </div>
572 | 
573 | 
574 | 
575 | ### Observations
576 | 
577 | From our initial results - it would appear that there might be some small directional differences in returns between the positive and neutral groups over shorter time frames (twoM and fiveM) after news broke. This is a pretty good basis for further investigation. So where could we go from here?
578 | 
579 | We have a relatively small *n* here so we might want to increase the size of the study. 
580 | 
581 | We might also want to try to seperate out more positive or negative news - ie change the threshold of the buckets to try to identify more prominent sentiment articles - maybe that could have more of an impact on performance. 
582 | 
583 | In terms of capturing news impact - we have thrown a lot of news articles out as they happened outside of market hours - as it is more complex to ascertain impact - we might try to find a way of including some of this in our analysis - I mentioned looking at overseas listings **GDR/ADRs** or surrogates above. Alternatively, we could using **EXACTLY** the same process looking at all news for an index future - say the **S&P500 emini** - as this trades on Globex pretty much round the clock - so we would be throwing out a lot less of the news articles? Great I hear you cry - but would each news article be able to influence a whole index? Are index futures more sensitive to some types of articles than others? Is there a temporal element to this? These are all excellent questions. Or what about cryptocrurrencies? They trade 24/7? and so on.
584 | 
585 | We could also investigate what is going on with our sentiment engine. We might be able to generate more meaningful results by tinkering with the underlyng processes and parameters. Using a different, more domain-specific corpora might help us to generate more relevant scores. 
586 | 
587 | You will see there is plenty of scope to get much more involved here. 
588 | 
589 | This article was intended as an introduction to this most interesting of areas. I hope to have de-mystified this area for you somewhat and shown how it is possible to get started with this type of complex analysis using only a few lines of code, a simple easy to use yet powerfull API and some really fantastic packages, to generate some meaningful results.
590 | 


--------------------------------------------------------------------------------
/Untitled14.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Introduction to News Sentiment Analysis with Eikon Data APIs - a Python example\n",
  8 |     "\n",
  9 |     "This article will demonstrate how we can conduct a simple sentiment analysis of news delivered via our new [Eikon Data APIs](https://developers.thomsonreuters.com/eikon-data-apis). Natural Language Processing (NLP) is a big area of interest for those looking to gain insight and new sources of value from the vast quantities of unstructured data out there. The area is quite complex and there are many resources online that can help you familiarise yourself with this very interesting area. There are also many different packages that can help you as well as many different approaches to this problem. Whilst these are beyond the scope of this article - I will go through a simple implementation which will give you a swift enough introduction and practical codebase for further exploration and learning.\n",
 10 |     "\n",
 11 |     "**Pre-requisites:** \n",
 12 |     "\n",
 13 |     "**Thomson Reuters Eikon** with access to new [Eikon Data APIs](https://developers.thomsonreuters.com/eikon-data-apis)\n",
 14 |     "\n",
 15 |     "**Python 2.x/3.x**\n",
 16 |     "\n",
 17 |     "**Required Python Packages:** eikon, pandas, numpy, beautifulsoup, textblob, datetime \n",
 18 |     "\n",
 19 |     "**Required corpora download:** >>>python -m textblob.download_corpora (this is required by the sentiment engine to generate sentiment)\n",
 20 |     "\n",
 21 |     "### Introduction\n",
 22 |     "\n",
 23 |     "NLP is a field which enables computers to understand human language (voice or text). This is quite a big area of research and a little enquiry on your part will furnish you with the complexities of this problem set. Here we will be focussing on one application of this called *Sentiment Analysis*. In our case we will be taking news articles(unstructured text) for a particular company, **IBM**, and we will attempt to grade this news to see how postive, negative or neutral it is. We will then try to see if this news has had an impact on the shareprice of **IBM**. \n",
 24 |     "\n",
 25 |     "To do this really well is a non-trivial task, and most universtities and financial companies will have departments and teams looking at this. We ourselves provide machine readable news products with News Analytics (such as sentiment) over our **Elektron** platform in realtime at very low latency - these products are essentially consumed by *algorithmic applications* as opposed to *humans*. \n",
 26 |     "\n",
 27 |     "We will try to do a similar thing as simply as possible to illustrate the key elements - our task is significantly eased by not having to do this in a low latency environment. We will be abstracting most of the complexities to do with the mechanics of actually analysing the text to various packages. You can then easily replace the modules such as the sentiment engine etc to improve your results as your understanding increases.  \n",
 28 |     "\n",
 29 |     "So lets get started. First lets load the packages that we will need to use and set our app_id. "
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 1,
 35 |    "metadata": {
 36 |     "collapsed": true
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "import eikon as ek\n",
 41 |     "import pandas as pd\n",
 42 |     "import numpy as np\n",
 43 |     "from bs4 import BeautifulSoup\n",
 44 |     "from textblob import TextBlob\n",
 45 |     "import datetime\n",
 46 |     "from datetime import time\n",
 47 |     "import warnings\n",
 48 |     "warnings.filterwarnings(\"ignore\")\n",
 49 |     "ek.set_app_key('YOUR APP KEY HERE')"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "There are two API calls for news:\n",
 57 |     "\n",
 58 |     "**get_news_headlines** : returns a list of news headlines satisfying a query\n",
 59 |     "\n",
 60 |     "**get_news_story** : returns a HTML representation of the full news article\n",
 61 |     "\n",
 62 |     "We will need to use both - thankfully they are really straightforward to use. We will need to use **get_news_headlines** API call to request a list of headlines. The first parameter for this call is a query. You dont really need to know this query language as you can generate it using the **News Monitor App** (type **NEWS** into Eikon search bar) in **Eikon**. \n",
 63 |     "\n",
 64 |     "You can see here I have just typed in 2 search terms, **IBM**, for the company, and, **English**, for the language I am interested in (in our example we will only be able to analyse English language text - though there are corpora, packages, methods you can employ to target other languages - though these are beyond the scope of this article). You can of course use any search terms you wish.\n",
 65 |     "\n",
 66 |     "![News App 1](Article 14 One.jpg)\n",
 67 |     "\n",
 68 |     "After you have typed in what you want to search for - we can simply click in the search box and this will then generate the query text which we can then copy and paste into the API call below. Its easy for us to change logical operations such as **AND** to **OR**, **NOT**  to suit our query. \n",
 69 |     "\n",
 70 |     "![News App 2](Article 14 Two.png)\n",
 71 |     "\n",
 72 |     "So the line of code below gets us 100 news headlines for **IBM** in english prior to 4th Dec 2017, and stores them in a dataframe, df for us."
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 2,
 78 |    "metadata": {
 79 |     "collapsed": false
 80 |    },
 81 |    "outputs": [
 82 |     {
 83 |      "data": {
 84 |       "text/html": [
 85 |        "<div>\n",
 86 |        "<table border=\"1\" class=\"dataframe\">\n",
 87 |        "  <thead>\n",
 88 |        "    <tr style=\"text-align: right;\">\n",
 89 |        "      <th></th>\n",
 90 |        "      <th>versionCreated</th>\n",
 91 |        "      <th>text</th>\n",
 92 |        "      <th>storyId</th>\n",
 93 |        "      <th>sourceCode</th>\n",
 94 |        "    </tr>\n",
 95 |        "  </thead>\n",
 96 |        "  <tbody>\n",
 97 |        "    <tr>\n",
 98 |        "      <th>2017-12-01 23:11:47.374</th>\n",
 99 |        "      <td>2017-12-01 23:11:47.374</td>\n",
100 |        "      <td>Reuters Insider - FM Final Trade: HAL, TWTR &amp; ...</td>\n",
101 |        "      <td>urn:newsml:reuters.com:20171201:nRTV8KBb1N:1</td>\n",
102 |        "      <td>NS:CNBC</td>\n",
103 |        "    </tr>\n",
104 |        "    <tr>\n",
105 |        "      <th>2017-12-01 19:19:20.279</th>\n",
106 |        "      <td>2017-12-01 19:19:20.279</td>\n",
107 |        "      <td>IBM ST: the upside prevails as long as 150.2 i...</td>\n",
108 |        "      <td>urn:newsml:reuters.com:20171201:nGUR2R6xQ:1</td>\n",
109 |        "      <td>NS:GURU</td>\n",
110 |        "    </tr>\n",
111 |        "    <tr>\n",
112 |        "      <th>2017-12-01 18:12:41.143</th>\n",
113 |        "      <td>2017-12-01 18:12:41.143</td>\n",
114 |        "      <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>\n",
115 |        "      <td>urn:newsml:reuters.com:20171201:nEOL6JJfRT:1</td>\n",
116 |        "      <td>NS:EDG</td>\n",
117 |        "    </tr>\n",
118 |        "    <tr>\n",
119 |        "      <th>2017-12-01 18:12:41.019</th>\n",
120 |        "      <td>2017-12-01 18:12:41.019</td>\n",
121 |        "      <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>\n",
122 |        "      <td>urn:newsml:reuters.com:20171201:nEOL3YHcVY:1</td>\n",
123 |        "      <td>NS:EDG</td>\n",
124 |        "    </tr>\n",
125 |        "    <tr>\n",
126 |        "      <th>2017-12-01 18:06:03.633</th>\n",
127 |        "      <td>2017-12-01 18:06:03.633</td>\n",
128 |        "      <td>Moody's Affirms Seven Classes of GSMS 2016-GS4</td>\n",
129 |        "      <td>urn:newsml:reuters.com:20171201:nMDY7wGNTP:1</td>\n",
130 |        "      <td>NS:RTRS</td>\n",
131 |        "    </tr>\n",
132 |        "  </tbody>\n",
133 |        "</table>\n",
134 |        "</div>"
135 |       ],
136 |       "text/plain": [
137 |        "                                 versionCreated  \\\n",
138 |        "2017-12-01 23:11:47.374 2017-12-01 23:11:47.374   \n",
139 |        "2017-12-01 19:19:20.279 2017-12-01 19:19:20.279   \n",
140 |        "2017-12-01 18:12:41.143 2017-12-01 18:12:41.143   \n",
141 |        "2017-12-01 18:12:41.019 2017-12-01 18:12:41.019   \n",
142 |        "2017-12-01 18:06:03.633 2017-12-01 18:06:03.633   \n",
143 |        "\n",
144 |        "                                                                      text  \\\n",
145 |        "2017-12-01 23:11:47.374  Reuters Insider - FM Final Trade: HAL, TWTR & ...   \n",
146 |        "2017-12-01 19:19:20.279  IBM ST: the upside prevails as long as 150.2 i...   \n",
147 |        "2017-12-01 18:12:41.143  INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...   \n",
148 |        "2017-12-01 18:12:41.019  INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...   \n",
149 |        "2017-12-01 18:06:03.633     Moody's Affirms Seven Classes of GSMS 2016-GS4   \n",
150 |        "\n",
151 |        "                                                              storyId  \\\n",
152 |        "2017-12-01 23:11:47.374  urn:newsml:reuters.com:20171201:nRTV8KBb1N:1   \n",
153 |        "2017-12-01 19:19:20.279   urn:newsml:reuters.com:20171201:nGUR2R6xQ:1   \n",
154 |        "2017-12-01 18:12:41.143  urn:newsml:reuters.com:20171201:nEOL6JJfRT:1   \n",
155 |        "2017-12-01 18:12:41.019  urn:newsml:reuters.com:20171201:nEOL3YHcVY:1   \n",
156 |        "2017-12-01 18:06:03.633  urn:newsml:reuters.com:20171201:nMDY7wGNTP:1   \n",
157 |        "\n",
158 |        "                        sourceCode  \n",
159 |        "2017-12-01 23:11:47.374    NS:CNBC  \n",
160 |        "2017-12-01 19:19:20.279    NS:GURU  \n",
161 |        "2017-12-01 18:12:41.143     NS:EDG  \n",
162 |        "2017-12-01 18:12:41.019     NS:EDG  \n",
163 |        "2017-12-01 18:06:03.633    NS:RTRS  "
164 |       ]
165 |      },
166 |      "execution_count": 2,
167 |      "metadata": {},
168 |      "output_type": "execute_result"
169 |     }
170 |    ],
171 |    "source": [
172 |     "df = ek.get_news_headlines('R:IBM.N AND Language:LEN', date_to = \"2017-12-04\", count=100)\n",
173 |     "df.head()"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "markdown",
178 |    "metadata": {},
179 |    "source": [
180 |     "I will just add 3 new columns which we will need to store some variables in later."
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": 3,
186 |    "metadata": {
187 |     "collapsed": false
188 |    },
189 |    "outputs": [],
190 |    "source": [
191 |     "df['Polarity'] = np.nan\n",
192 |     "df['Subjectivity'] = np.nan\n",
193 |     "df['Score'] = np.nan"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "markdown",
198 |    "metadata": {},
199 |    "source": [
200 |     "So we have our frame with the most recent 100 news headline items. The headline is stored in the **text** column and the storyID which we will now use to pull down the actual articles themselves, is stored in the **storyID** column. \n",
201 |     "\n",
202 |     "We will now iterate through the headline dataframe and pull down the news articles using the second of our news API calls, get_news_story. We simply pass the **storyID** to this API call and we are returned a HTML representation of the article - which allows you to render them nicely etc - however for our purposes we want to strip the HTML tags etc out and just be left with the plain text - as we dont want to analyse HTML tags for sentiment. We will do this using the excellent **BeautifulSoup** package.\n",
203 |     "\n",
204 |     "Once we have the text of these articles we can pass them to our sentiment engine which will give us a sentiment score for each article. So what is our sentiment engine? We will be using the simple **TextBlob** package to demo a rudimentary process to show you how things work. **TextBlob** is a higher level abstraction package that sits on top of **NLTK** (Natural Language Toolkit) which is a widely used package for this type of task. \n",
205 |     "\n",
206 |     "**NLTK** is quite a complex package which gives you a lot of control over the whole analytical process - but the cost of that is complexity and required knowledge of the steps invloved. **TextBlob** shields us from this complexity, but we should at some stage understand what is going on under the hood. Thankfully there is plenty of information to guide us in this. We will be implementing the default **PatternAnalyzer** which is based on the popular **Pattern** library though there is also a **NaiveBayesAnalyzer** which is a **NLTK** classifier based on a movie review corpus. \n",
207 |     "\n",
208 |     "All of this can be achieved in just a few lines of code. This is quite a dense codeblock - so I have commented the key steps.  "
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": 4,
214 |    "metadata": {
215 |     "collapsed": false
216 |    },
217 |    "outputs": [
218 |     {
219 |      "data": {
220 |       "text/html": [
221 |        "<div>\n",
222 |        "<table border=\"1\" class=\"dataframe\">\n",
223 |        "  <thead>\n",
224 |        "    <tr style=\"text-align: right;\">\n",
225 |        "      <th></th>\n",
226 |        "      <th>versionCreated</th>\n",
227 |        "      <th>text</th>\n",
228 |        "      <th>storyId</th>\n",
229 |        "      <th>sourceCode</th>\n",
230 |        "      <th>Polarity</th>\n",
231 |        "      <th>Subjectivity</th>\n",
232 |        "      <th>Score</th>\n",
233 |        "    </tr>\n",
234 |        "  </thead>\n",
235 |        "  <tbody>\n",
236 |        "    <tr>\n",
237 |        "      <th>2017-12-01 23:11:47.374</th>\n",
238 |        "      <td>2017-12-01 23:11:47.374</td>\n",
239 |        "      <td>Reuters Insider - FM Final Trade: HAL, TWTR &amp; ...</td>\n",
240 |        "      <td>urn:newsml:reuters.com:20171201:nRTV8KBb1N:1</td>\n",
241 |        "      <td>NS:CNBC</td>\n",
242 |        "      <td>0.066667</td>\n",
243 |        "      <td>0.566667</td>\n",
244 |        "      <td>positive</td>\n",
245 |        "    </tr>\n",
246 |        "    <tr>\n",
247 |        "      <th>2017-12-01 19:19:20.279</th>\n",
248 |        "      <td>2017-12-01 19:19:20.279</td>\n",
249 |        "      <td>IBM ST: the upside prevails as long as 150.2 i...</td>\n",
250 |        "      <td>urn:newsml:reuters.com:20171201:nGUR2R6xQ:1</td>\n",
251 |        "      <td>NS:GURU</td>\n",
252 |        "      <td>0.055260</td>\n",
253 |        "      <td>0.320844</td>\n",
254 |        "      <td>positive</td>\n",
255 |        "    </tr>\n",
256 |        "    <tr>\n",
257 |        "      <th>2017-12-01 18:12:41.143</th>\n",
258 |        "      <td>2017-12-01 18:12:41.143</td>\n",
259 |        "      <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>\n",
260 |        "      <td>urn:newsml:reuters.com:20171201:nEOL6JJfRT:1</td>\n",
261 |        "      <td>NS:EDG</td>\n",
262 |        "      <td>0.000000</td>\n",
263 |        "      <td>0.000000</td>\n",
264 |        "      <td>neutral</td>\n",
265 |        "    </tr>\n",
266 |        "    <tr>\n",
267 |        "      <th>2017-12-01 18:12:41.019</th>\n",
268 |        "      <td>2017-12-01 18:12:41.019</td>\n",
269 |        "      <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>\n",
270 |        "      <td>urn:newsml:reuters.com:20171201:nEOL3YHcVY:1</td>\n",
271 |        "      <td>NS:EDG</td>\n",
272 |        "      <td>0.000000</td>\n",
273 |        "      <td>0.000000</td>\n",
274 |        "      <td>neutral</td>\n",
275 |        "    </tr>\n",
276 |        "    <tr>\n",
277 |        "      <th>2017-12-01 18:06:03.633</th>\n",
278 |        "      <td>2017-12-01 18:06:03.633</td>\n",
279 |        "      <td>Moody's Affirms Seven Classes of GSMS 2016-GS4</td>\n",
280 |        "      <td>urn:newsml:reuters.com:20171201:nMDY7wGNTP:1</td>\n",
281 |        "      <td>NS:RTRS</td>\n",
282 |        "      <td>0.175000</td>\n",
283 |        "      <td>0.325000</td>\n",
284 |        "      <td>positive</td>\n",
285 |        "    </tr>\n",
286 |        "  </tbody>\n",
287 |        "</table>\n",
288 |        "</div>"
289 |       ],
290 |       "text/plain": [
291 |        "                                 versionCreated  \\\n",
292 |        "2017-12-01 23:11:47.374 2017-12-01 23:11:47.374   \n",
293 |        "2017-12-01 19:19:20.279 2017-12-01 19:19:20.279   \n",
294 |        "2017-12-01 18:12:41.143 2017-12-01 18:12:41.143   \n",
295 |        "2017-12-01 18:12:41.019 2017-12-01 18:12:41.019   \n",
296 |        "2017-12-01 18:06:03.633 2017-12-01 18:06:03.633   \n",
297 |        "\n",
298 |        "                                                                      text  \\\n",
299 |        "2017-12-01 23:11:47.374  Reuters Insider - FM Final Trade: HAL, TWTR & ...   \n",
300 |        "2017-12-01 19:19:20.279  IBM ST: the upside prevails as long as 150.2 i...   \n",
301 |        "2017-12-01 18:12:41.143  INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...   \n",
302 |        "2017-12-01 18:12:41.019  INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...   \n",
303 |        "2017-12-01 18:06:03.633     Moody's Affirms Seven Classes of GSMS 2016-GS4   \n",
304 |        "\n",
305 |        "                                                              storyId  \\\n",
306 |        "2017-12-01 23:11:47.374  urn:newsml:reuters.com:20171201:nRTV8KBb1N:1   \n",
307 |        "2017-12-01 19:19:20.279   urn:newsml:reuters.com:20171201:nGUR2R6xQ:1   \n",
308 |        "2017-12-01 18:12:41.143  urn:newsml:reuters.com:20171201:nEOL6JJfRT:1   \n",
309 |        "2017-12-01 18:12:41.019  urn:newsml:reuters.com:20171201:nEOL3YHcVY:1   \n",
310 |        "2017-12-01 18:06:03.633  urn:newsml:reuters.com:20171201:nMDY7wGNTP:1   \n",
311 |        "\n",
312 |        "                        sourceCode  Polarity  Subjectivity     Score  \n",
313 |        "2017-12-01 23:11:47.374    NS:CNBC  0.066667      0.566667  positive  \n",
314 |        "2017-12-01 19:19:20.279    NS:GURU  0.055260      0.320844  positive  \n",
315 |        "2017-12-01 18:12:41.143     NS:EDG  0.000000      0.000000   neutral  \n",
316 |        "2017-12-01 18:12:41.019     NS:EDG  0.000000      0.000000   neutral  \n",
317 |        "2017-12-01 18:06:03.633    NS:RTRS  0.175000      0.325000  positive  "
318 |       ]
319 |      },
320 |      "execution_count": 4,
321 |      "metadata": {},
322 |      "output_type": "execute_result"
323 |     }
324 |    ],
325 |    "source": [
326 |     "for idx, storyId in enumerate(df['storyId'].values):  #for each row in our df dataframe\n",
327 |     "    newsText = ek.get_news_story(storyId) #get the news story\n",
328 |     "    if newsText:\n",
329 |     "        soup = BeautifulSoup(newsText,\"lxml\") #create a BeautifulSoup object from our HTML news article\n",
330 |     "        sentA = TextBlob(soup.get_text()) #pass the text only article to TextBlob to anaylse\n",
331 |     "        df['Polarity'].iloc[idx] = sentA.sentiment.polarity #write sentiment polarity back to df\n",
332 |     "        df['Subjectivity'].iloc[idx] = sentA.sentiment.subjectivity #write sentiment subjectivity score back to df\n",
333 |     "        if sentA.sentiment.polarity >= 0.05: # attribute bucket to sentiment polartiy\n",
334 |     "            score = 'positive'\n",
335 |     "        elif  -.05 < sentA.sentiment.polarity < 0.05:\n",
336 |     "            score = 'neutral'\n",
337 |     "        else:\n",
338 |     "            score = 'negative'\n",
339 |     "        df['Score'].iloc[idx] = score #write score back to df\n",
340 |     "df.head()"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "markdown",
345 |    "metadata": {},
346 |    "source": [
347 |     "Looking at our dataframe we can now see 3 new columns on the right, *Polarity*, *Subjectivity* and *Score*. As we have seen *Polarity* is the actual sentiment polarity returned from **TextBlob** (ranging from -1(negative) to +1(positive), *Subjectivity* is a measure (ranging from 0 to 1) where 0 is very objective and 1 is very subjective, and *Score* is simply a Positive, Negative or Neutral rating based on the strength of the polarities. \n",
348 |     "\n",
349 |     "We would now like to see what, if any, impact this news has had on the shareprice of **IBM**. There are many ways of doing this - but to make things simple, I would like to see what the average return is at various points in time **AFTER** the news has broken. I want to check if there are *aggregate differences* in the *average returns* from the Positive, Neutral and Negative buckets we created earlier."
350 |    ]
351 |   },
352 |   {
353 |    "cell_type": "code",
354 |    "execution_count": 5,
355 |    "metadata": {
356 |     "collapsed": false
357 |    },
358 |    "outputs": [
359 |     {
360 |      "data": {
361 |       "text/html": [
362 |        "<div>\n",
363 |        "<table border=\"1\" class=\"dataframe\">\n",
364 |        "  <thead>\n",
365 |        "    <tr style=\"text-align: right;\">\n",
366 |        "      <th>IBM.N</th>\n",
367 |        "      <th>HIGH</th>\n",
368 |        "      <th>LOW</th>\n",
369 |        "      <th>OPEN</th>\n",
370 |        "      <th>CLOSE</th>\n",
371 |        "      <th>COUNT</th>\n",
372 |        "      <th>VOLUME</th>\n",
373 |        "    </tr>\n",
374 |        "    <tr>\n",
375 |        "      <th>Date</th>\n",
376 |        "      <th></th>\n",
377 |        "      <th></th>\n",
378 |        "      <th></th>\n",
379 |        "      <th></th>\n",
380 |        "      <th></th>\n",
381 |        "      <th></th>\n",
382 |        "    </tr>\n",
383 |        "  </thead>\n",
384 |        "  <tbody>\n",
385 |        "    <tr>\n",
386 |        "      <th>2018-01-05 15:21:00</th>\n",
387 |        "      <td>162.32</td>\n",
388 |        "      <td>162.18</td>\n",
389 |        "      <td>162.22</td>\n",
390 |        "      <td>162.31</td>\n",
391 |        "      <td>23.0</td>\n",
392 |        "      <td>3073.0</td>\n",
393 |        "    </tr>\n",
394 |        "    <tr>\n",
395 |        "      <th>2018-01-05 15:22:00</th>\n",
396 |        "      <td>162.42</td>\n",
397 |        "      <td>162.29</td>\n",
398 |        "      <td>162.31</td>\n",
399 |        "      <td>162.42</td>\n",
400 |        "      <td>23.0</td>\n",
401 |        "      <td>2442.0</td>\n",
402 |        "    </tr>\n",
403 |        "    <tr>\n",
404 |        "      <th>2018-01-05 15:23:00</th>\n",
405 |        "      <td>162.46</td>\n",
406 |        "      <td>162.43</td>\n",
407 |        "      <td>162.45</td>\n",
408 |        "      <td>162.46</td>\n",
409 |        "      <td>11.0</td>\n",
410 |        "      <td>960.0</td>\n",
411 |        "    </tr>\n",
412 |        "    <tr>\n",
413 |        "      <th>2018-01-05 15:24:00</th>\n",
414 |        "      <td>162.46</td>\n",
415 |        "      <td>162.40</td>\n",
416 |        "      <td>162.46</td>\n",
417 |        "      <td>162.40</td>\n",
418 |        "      <td>5.0</td>\n",
419 |        "      <td>505.0</td>\n",
420 |        "    </tr>\n",
421 |        "    <tr>\n",
422 |        "      <th>2018-01-05 15:25:00</th>\n",
423 |        "      <td>162.39</td>\n",
424 |        "      <td>162.31</td>\n",
425 |        "      <td>162.36</td>\n",
426 |        "      <td>162.33</td>\n",
427 |        "      <td>12.0</td>\n",
428 |        "      <td>1060.0</td>\n",
429 |        "    </tr>\n",
430 |        "  </tbody>\n",
431 |        "</table>\n",
432 |        "</div>"
433 |       ],
434 |       "text/plain": [
435 |        "IBM.N                  HIGH     LOW    OPEN   CLOSE  COUNT  VOLUME\n",
436 |        "Date                                                              \n",
437 |        "2018-01-05 15:21:00  162.32  162.18  162.22  162.31   23.0  3073.0\n",
438 |        "2018-01-05 15:22:00  162.42  162.29  162.31  162.42   23.0  2442.0\n",
439 |        "2018-01-05 15:23:00  162.46  162.43  162.45  162.46   11.0   960.0\n",
440 |        "2018-01-05 15:24:00  162.46  162.40  162.46  162.40    5.0   505.0\n",
441 |        "2018-01-05 15:25:00  162.39  162.31  162.36  162.33   12.0  1060.0"
442 |       ]
443 |      },
444 |      "execution_count": 5,
445 |      "metadata": {},
446 |      "output_type": "execute_result"
447 |     }
448 |    ],
449 |    "source": [
450 |     "start = df['versionCreated'].min().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')\n",
451 |     "end = df['versionCreated'].max().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')\n",
452 |     "Minute = ek.get_timeseries([\"IBM.N\"], start_date=start, interval=\"minute\")\n",
453 |     "Minute.tail()"
454 |    ]
455 |   },
456 |   {
457 |    "cell_type": "markdown",
458 |    "metadata": {},
459 |    "source": [
460 |     "We will need to create some new columns for the next part of this analysis."
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "code",
465 |    "execution_count": 6,
466 |    "metadata": {
467 |     "collapsed": false
468 |    },
469 |    "outputs": [
470 |     {
471 |      "data": {
472 |       "text/html": [
473 |        "<div>\n",
474 |        "<table border=\"1\" class=\"dataframe\">\n",
475 |        "  <thead>\n",
476 |        "    <tr style=\"text-align: right;\">\n",
477 |        "      <th></th>\n",
478 |        "      <th>versionCreated</th>\n",
479 |        "      <th>text</th>\n",
480 |        "      <th>storyId</th>\n",
481 |        "      <th>sourceCode</th>\n",
482 |        "      <th>Polarity</th>\n",
483 |        "      <th>Subjectivity</th>\n",
484 |        "      <th>Score</th>\n",
485 |        "      <th>twoM</th>\n",
486 |        "      <th>fiveM</th>\n",
487 |        "      <th>tenM</th>\n",
488 |        "      <th>thirtyM</th>\n",
489 |        "    </tr>\n",
490 |        "  </thead>\n",
491 |        "  <tbody>\n",
492 |        "    <tr>\n",
493 |        "      <th>2017-12-01 23:11:47.374</th>\n",
494 |        "      <td>2017-12-01 23:11:47.374</td>\n",
495 |        "      <td>Reuters Insider - FM Final Trade: HAL, TWTR &amp; ...</td>\n",
496 |        "      <td>urn:newsml:reuters.com:20171201:nRTV8KBb1N:1</td>\n",
497 |        "      <td>NS:CNBC</td>\n",
498 |        "      <td>0.066667</td>\n",
499 |        "      <td>0.566667</td>\n",
500 |        "      <td>positive</td>\n",
501 |        "      <td>NaN</td>\n",
502 |        "      <td>NaN</td>\n",
503 |        "      <td>NaN</td>\n",
504 |        "      <td>NaN</td>\n",
505 |        "    </tr>\n",
506 |        "    <tr>\n",
507 |        "      <th>2017-12-01 19:19:20.279</th>\n",
508 |        "      <td>2017-12-01 19:19:20.279</td>\n",
509 |        "      <td>IBM ST: the upside prevails as long as 150.2 i...</td>\n",
510 |        "      <td>urn:newsml:reuters.com:20171201:nGUR2R6xQ:1</td>\n",
511 |        "      <td>NS:GURU</td>\n",
512 |        "      <td>0.055260</td>\n",
513 |        "      <td>0.320844</td>\n",
514 |        "      <td>positive</td>\n",
515 |        "      <td>NaN</td>\n",
516 |        "      <td>NaN</td>\n",
517 |        "      <td>NaN</td>\n",
518 |        "      <td>NaN</td>\n",
519 |        "    </tr>\n",
520 |        "  </tbody>\n",
521 |        "</table>\n",
522 |        "</div>"
523 |       ],
524 |       "text/plain": [
525 |        "                                 versionCreated  \\\n",
526 |        "2017-12-01 23:11:47.374 2017-12-01 23:11:47.374   \n",
527 |        "2017-12-01 19:19:20.279 2017-12-01 19:19:20.279   \n",
528 |        "\n",
529 |        "                                                                      text  \\\n",
530 |        "2017-12-01 23:11:47.374  Reuters Insider - FM Final Trade: HAL, TWTR & ...   \n",
531 |        "2017-12-01 19:19:20.279  IBM ST: the upside prevails as long as 150.2 i...   \n",
532 |        "\n",
533 |        "                                                              storyId  \\\n",
534 |        "2017-12-01 23:11:47.374  urn:newsml:reuters.com:20171201:nRTV8KBb1N:1   \n",
535 |        "2017-12-01 19:19:20.279   urn:newsml:reuters.com:20171201:nGUR2R6xQ:1   \n",
536 |        "\n",
537 |        "                        sourceCode  Polarity  Subjectivity     Score  twoM  \\\n",
538 |        "2017-12-01 23:11:47.374    NS:CNBC  0.066667      0.566667  positive   NaN   \n",
539 |        "2017-12-01 19:19:20.279    NS:GURU  0.055260      0.320844  positive   NaN   \n",
540 |        "\n",
541 |        "                         fiveM  tenM  thirtyM  \n",
542 |        "2017-12-01 23:11:47.374    NaN   NaN      NaN  \n",
543 |        "2017-12-01 19:19:20.279    NaN   NaN      NaN  "
544 |       ]
545 |      },
546 |      "execution_count": 6,
547 |      "metadata": {},
548 |      "output_type": "execute_result"
549 |     }
550 |    ],
551 |    "source": [
552 |     "df['twoM'] = np.nan\n",
553 |     "df['fiveM'] = np.nan\n",
554 |     "df['tenM'] = np.nan\n",
555 |     "df['thirtyM'] = np.nan\n",
556 |     "df.head(2)"
557 |    ]
558 |   },
559 |   {
560 |    "cell_type": "markdown",
561 |    "metadata": {},
562 |    "source": [
563 |     "OK so I now just need to get the timestamp of each news item, truncate it to minute data (ie remove second and microsecond components) and get the base shareprice of **IBM** at that time, and at several itervals after that time, in our case *t+2 mins,t+5 mins, t+10 mins, t+30 mins*, calculating the % change for each interval. \n",
564 |     "\n",
565 |     "An important point to bear in mind here is that news can be generated at anytime - 24 hours a day - outside of normal market hours. So for news generated outside normal market hours for **IBM** in our case, we would have to wait until the next market opening to conduct our calculations. Of course there are a number of issues here concerning our ability to attribute price movement to our news item in isolation (basically we cannot). That said, there might be other ways of doing this - for example looking at **GDRs/ADRs** or surrogates etc - these are beyond the scope of this introductory article. In our example, these news items are simply discarded. \n",
566 |     "\n",
567 |     "We will now loop through each news item in the dataframe, calculate (where possible) and store the derived performance numbers in the columns we created earlier: twoM...thirtyM."
568 |    ]
569 |   },
570 |   {
571 |    "cell_type": "code",
572 |    "execution_count": 7,
573 |    "metadata": {
574 |     "collapsed": false
575 |    },
576 |    "outputs": [
577 |     {
578 |      "data": {
579 |       "text/html": [
580 |        "<div>\n",
581 |        "<table border=\"1\" class=\"dataframe\">\n",
582 |        "  <thead>\n",
583 |        "    <tr style=\"text-align: right;\">\n",
584 |        "      <th></th>\n",
585 |        "      <th>versionCreated</th>\n",
586 |        "      <th>text</th>\n",
587 |        "      <th>storyId</th>\n",
588 |        "      <th>sourceCode</th>\n",
589 |        "      <th>Polarity</th>\n",
590 |        "      <th>Subjectivity</th>\n",
591 |        "      <th>Score</th>\n",
592 |        "      <th>twoM</th>\n",
593 |        "      <th>fiveM</th>\n",
594 |        "      <th>tenM</th>\n",
595 |        "      <th>thirtyM</th>\n",
596 |        "    </tr>\n",
597 |        "  </thead>\n",
598 |        "  <tbody>\n",
599 |        "    <tr>\n",
600 |        "      <th>2017-12-01 23:11:47.374</th>\n",
601 |        "      <td>2017-12-01 23:11:47.374</td>\n",
602 |        "      <td>Reuters Insider - FM Final Trade: HAL, TWTR &amp; ...</td>\n",
603 |        "      <td>urn:newsml:reuters.com:20171201:nRTV8KBb1N:1</td>\n",
604 |        "      <td>NS:CNBC</td>\n",
605 |        "      <td>0.066667</td>\n",
606 |        "      <td>0.566667</td>\n",
607 |        "      <td>positive</td>\n",
608 |        "      <td>NaN</td>\n",
609 |        "      <td>NaN</td>\n",
610 |        "      <td>NaN</td>\n",
611 |        "      <td>NaN</td>\n",
612 |        "    </tr>\n",
613 |        "    <tr>\n",
614 |        "      <th>2017-12-01 19:19:20.279</th>\n",
615 |        "      <td>2017-12-01 19:19:20.279</td>\n",
616 |        "      <td>IBM ST: the upside prevails as long as 150.2 i...</td>\n",
617 |        "      <td>urn:newsml:reuters.com:20171201:nGUR2R6xQ:1</td>\n",
618 |        "      <td>NS:GURU</td>\n",
619 |        "      <td>0.055260</td>\n",
620 |        "      <td>0.320844</td>\n",
621 |        "      <td>positive</td>\n",
622 |        "      <td>0.071119</td>\n",
623 |        "      <td>0.084050</td>\n",
624 |        "      <td>0.000000</td>\n",
625 |        "      <td>-0.109911</td>\n",
626 |        "    </tr>\n",
627 |        "    <tr>\n",
628 |        "      <th>2017-12-01 18:12:41.143</th>\n",
629 |        "      <td>2017-12-01 18:12:41.143</td>\n",
630 |        "      <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>\n",
631 |        "      <td>urn:newsml:reuters.com:20171201:nEOL6JJfRT:1</td>\n",
632 |        "      <td>NS:EDG</td>\n",
633 |        "      <td>0.000000</td>\n",
634 |        "      <td>0.000000</td>\n",
635 |        "      <td>neutral</td>\n",
636 |        "      <td>0.012944</td>\n",
637 |        "      <td>-0.090609</td>\n",
638 |        "      <td>-0.032360</td>\n",
639 |        "      <td>0.148858</td>\n",
640 |        "    </tr>\n",
641 |        "    <tr>\n",
642 |        "      <th>2017-12-01 18:12:41.019</th>\n",
643 |        "      <td>2017-12-01 18:12:41.019</td>\n",
644 |        "      <td>INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...</td>\n",
645 |        "      <td>urn:newsml:reuters.com:20171201:nEOL3YHcVY:1</td>\n",
646 |        "      <td>NS:EDG</td>\n",
647 |        "      <td>0.000000</td>\n",
648 |        "      <td>0.000000</td>\n",
649 |        "      <td>neutral</td>\n",
650 |        "      <td>0.012944</td>\n",
651 |        "      <td>-0.090609</td>\n",
652 |        "      <td>-0.032360</td>\n",
653 |        "      <td>0.148858</td>\n",
654 |        "    </tr>\n",
655 |        "    <tr>\n",
656 |        "      <th>2017-12-01 18:06:03.633</th>\n",
657 |        "      <td>2017-12-01 18:06:03.633</td>\n",
658 |        "      <td>Moody's Affirms Seven Classes of GSMS 2016-GS4</td>\n",
659 |        "      <td>urn:newsml:reuters.com:20171201:nMDY7wGNTP:1</td>\n",
660 |        "      <td>NS:RTRS</td>\n",
661 |        "      <td>0.175000</td>\n",
662 |        "      <td>0.325000</td>\n",
663 |        "      <td>positive</td>\n",
664 |        "      <td>0.097238</td>\n",
665 |        "      <td>0.155581</td>\n",
666 |        "      <td>0.097238</td>\n",
667 |        "      <td>0.246337</td>\n",
668 |        "    </tr>\n",
669 |        "  </tbody>\n",
670 |        "</table>\n",
671 |        "</div>"
672 |       ],
673 |       "text/plain": [
674 |        "                                 versionCreated  \\\n",
675 |        "2017-12-01 23:11:47.374 2017-12-01 23:11:47.374   \n",
676 |        "2017-12-01 19:19:20.279 2017-12-01 19:19:20.279   \n",
677 |        "2017-12-01 18:12:41.143 2017-12-01 18:12:41.143   \n",
678 |        "2017-12-01 18:12:41.019 2017-12-01 18:12:41.019   \n",
679 |        "2017-12-01 18:06:03.633 2017-12-01 18:06:03.633   \n",
680 |        "\n",
681 |        "                                                                      text  \\\n",
682 |        "2017-12-01 23:11:47.374  Reuters Insider - FM Final Trade: HAL, TWTR & ...   \n",
683 |        "2017-12-01 19:19:20.279  IBM ST: the upside prevails as long as 150.2 i...   \n",
684 |        "2017-12-01 18:12:41.143  INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...   \n",
685 |        "2017-12-01 18:12:41.019  INTERNATIONAL BUSINESS MACHINES CORP SEC Filin...   \n",
686 |        "2017-12-01 18:06:03.633     Moody's Affirms Seven Classes of GSMS 2016-GS4   \n",
687 |        "\n",
688 |        "                                                              storyId  \\\n",
689 |        "2017-12-01 23:11:47.374  urn:newsml:reuters.com:20171201:nRTV8KBb1N:1   \n",
690 |        "2017-12-01 19:19:20.279   urn:newsml:reuters.com:20171201:nGUR2R6xQ:1   \n",
691 |        "2017-12-01 18:12:41.143  urn:newsml:reuters.com:20171201:nEOL6JJfRT:1   \n",
692 |        "2017-12-01 18:12:41.019  urn:newsml:reuters.com:20171201:nEOL3YHcVY:1   \n",
693 |        "2017-12-01 18:06:03.633  urn:newsml:reuters.com:20171201:nMDY7wGNTP:1   \n",
694 |        "\n",
695 |        "                        sourceCode  Polarity  Subjectivity     Score  \\\n",
696 |        "2017-12-01 23:11:47.374    NS:CNBC  0.066667      0.566667  positive   \n",
697 |        "2017-12-01 19:19:20.279    NS:GURU  0.055260      0.320844  positive   \n",
698 |        "2017-12-01 18:12:41.143     NS:EDG  0.000000      0.000000   neutral   \n",
699 |        "2017-12-01 18:12:41.019     NS:EDG  0.000000      0.000000   neutral   \n",
700 |        "2017-12-01 18:06:03.633    NS:RTRS  0.175000      0.325000  positive   \n",
701 |        "\n",
702 |        "                             twoM     fiveM      tenM   thirtyM  \n",
703 |        "2017-12-01 23:11:47.374       NaN       NaN       NaN       NaN  \n",
704 |        "2017-12-01 19:19:20.279  0.071119  0.084050  0.000000 -0.109911  \n",
705 |        "2017-12-01 18:12:41.143  0.012944 -0.090609 -0.032360  0.148858  \n",
706 |        "2017-12-01 18:12:41.019  0.012944 -0.090609 -0.032360  0.148858  \n",
707 |        "2017-12-01 18:06:03.633  0.097238  0.155581  0.097238  0.246337  "
708 |       ]
709 |      },
710 |      "execution_count": 7,
711 |      "metadata": {},
712 |      "output_type": "execute_result"
713 |     }
714 |    ],
715 |    "source": [
716 |     "for idx, newsDate in enumerate(df['versionCreated'].values):\n",
717 |     "    sTime = df['versionCreated'][idx]\n",
718 |     "    sTime = sTime.replace(second=0,microsecond=0)\n",
719 |     "    try:\n",
720 |     "        t0 = Minute.iloc[Minute.index.get_loc(sTime),2]\n",
721 |     "        df['twoM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=2))),3]/(t0)-1)*100)\n",
722 |     "        df['fiveM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=5))),3]/(t0)-1)*100)\n",
723 |     "        df['tenM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100) \n",
724 |     "        df['thirtyM'][idx] = ((Minute.iloc[Minute.index.get_loc((sTime + datetime.timedelta(minutes=30))),3]/(t0)-1)*100)\n",
725 |     "    except:\n",
726 |     "        pass\n",
727 |     "df.head()"
728 |    ]
729 |   },
730 |   {
731 |    "cell_type": "markdown",
732 |    "metadata": {},
733 |    "source": [
734 |     "Fantastic - we have now completed the analytical part of our study. Finally, we just need to aggregate our results by *Score* bucket in order to draw some conclusions. "
735 |    ]
736 |   },
737 |   {
738 |    "cell_type": "code",
739 |    "execution_count": 8,
740 |    "metadata": {
741 |     "collapsed": false
742 |    },
743 |    "outputs": [
744 |     {
745 |      "data": {
746 |       "text/html": [
747 |        "<div>\n",
748 |        "<table border=\"1\" class=\"dataframe\">\n",
749 |        "  <thead>\n",
750 |        "    <tr style=\"text-align: right;\">\n",
751 |        "      <th></th>\n",
752 |        "      <th>Polarity</th>\n",
753 |        "      <th>Subjectivity</th>\n",
754 |        "      <th>twoM</th>\n",
755 |        "      <th>fiveM</th>\n",
756 |        "      <th>tenM</th>\n",
757 |        "      <th>thirtyM</th>\n",
758 |        "    </tr>\n",
759 |        "    <tr>\n",
760 |        "      <th>Score</th>\n",
761 |        "      <th></th>\n",
762 |        "      <th></th>\n",
763 |        "      <th></th>\n",
764 |        "      <th></th>\n",
765 |        "      <th></th>\n",
766 |        "      <th></th>\n",
767 |        "    </tr>\n",
768 |        "  </thead>\n",
769 |        "  <tbody>\n",
770 |        "    <tr>\n",
771 |        "      <th>negative</th>\n",
772 |        "      <td>-0.146508</td>\n",
773 |        "      <td>0.316746</td>\n",
774 |        "      <td>NaN</td>\n",
775 |        "      <td>NaN</td>\n",
776 |        "      <td>NaN</td>\n",
777 |        "      <td>NaN</td>\n",
778 |        "    </tr>\n",
779 |        "    <tr>\n",
780 |        "      <th>neutral</th>\n",
781 |        "      <td>0.006436</td>\n",
782 |        "      <td>0.175766</td>\n",
783 |        "      <td>-0.004829</td>\n",
784 |        "      <td>-0.009502</td>\n",
785 |        "      <td>0.028979</td>\n",
786 |        "      <td>0.137544</td>\n",
787 |        "    </tr>\n",
788 |        "    <tr>\n",
789 |        "      <th>positive</th>\n",
790 |        "      <td>0.129260</td>\n",
791 |        "      <td>0.406868</td>\n",
792 |        "      <td>0.012089</td>\n",
793 |        "      <td>0.012776</td>\n",
794 |        "      <td>0.035936</td>\n",
795 |        "      <td>0.047345</td>\n",
796 |        "    </tr>\n",
797 |        "  </tbody>\n",
798 |        "</table>\n",
799 |        "</div>"
800 |       ],
801 |       "text/plain": [
802 |        "          Polarity  Subjectivity      twoM     fiveM      tenM   thirtyM\n",
803 |        "Score                                                                   \n",
804 |        "negative -0.146508      0.316746       NaN       NaN       NaN       NaN\n",
805 |        "neutral   0.006436      0.175766 -0.004829 -0.009502  0.028979  0.137544\n",
806 |        "positive  0.129260      0.406868  0.012089  0.012776  0.035936  0.047345"
807 |       ]
808 |      },
809 |      "execution_count": 8,
810 |      "metadata": {},
811 |      "output_type": "execute_result"
812 |     }
813 |    ],
814 |    "source": [
815 |     "grouped = df.groupby(['Score']).mean()\n",
816 |     "grouped"
817 |    ]
818 |   },
819 |   {
820 |    "cell_type": "markdown",
821 |    "metadata": {},
822 |    "source": [
823 |     "### Observations\n",
824 |     "\n",
825 |     "From our initial results - it would appear that there might be some small directional differences in returns between the positive and neutral groups over shorter time frames (twoM and fiveM) after news broke. This is a pretty good basis for further investigation. So where could we go from here?\n",
826 |     "\n",
827 |     "We have a relatively small *n* here so we might want to increase the size of the study. \n",
828 |     "\n",
829 |     "We might also want to try to seperate out more positive or negative news - ie change the threshold of the buckets to try to identify more prominent sentiment articles - maybe that could have more of an impact on performance. \n",
830 |     "\n",
831 |     "In terms of capturing news impact - we have thrown a lot of news articles out as they happened outside of market hours - as it is more complex to ascertain impact - we might try to find a way of including some of this in our analysis - I mentioned looking at overseas listings **GDR/ADRs** or surrogates above. Alternatively, we could using **EXACTLY** the same process looking at all news for an index future - say the **S&P500 emini** - as this trades on Globex pretty much round the clock - so we would be throwing out a lot less of the news articles? Great I hear you cry - but would each news article be able to influence a whole index? Are index futures more sensitive to some types of articles than others? Is there a temporal element to this? These are all excellent questions. Or what about cryptocrurrencies? They trade 24/7? and so on.\n",
832 |     "\n",
833 |     "We could also investigate what is going on with our sentiment engine. We might be able to generate more meaningful results by tinkering with the underlyng processes and parameters. Using a different, more domain-specific corpora might help us to generate more relevant scores. \n",
834 |     "\n",
835 |     "You will see there is plenty of scope to get much more involved here. \n",
836 |     "\n",
837 |     "This article was intended as an introduction to this most interesting of areas. I hope to have de-mystified this area for you somewhat and shown how it is possible to get started with this type of complex analysis using only a few lines of code, a simple easy to use yet powerfull API and some really fantastic packages, to generate some meaningful results."
838 |    ]
839 |   }
840 |  ],
841 |  "metadata": {
842 |   "kernelspec": {
843 |    "display_name": "Python 2",
844 |    "language": "python",
845 |    "name": "python2"
846 |   },
847 |   "language_info": {
848 |    "codemirror_mode": {
849 |     "name": "ipython",
850 |     "version": 2
851 |    },
852 |    "file_extension": ".py",
853 |    "mimetype": "text/x-python",
854 |    "name": "python",
855 |    "nbconvert_exporter": "python",
856 |    "pygments_lexer": "ipython2",
857 |    "version": "2.7.11"
858 |   }
859 |  },
860 |  "nbformat": 4,
861 |  "nbformat_minor": 0
862 | }
863 | 


--------------------------------------------------------------------------------