├── requirements.txt
├── sentiment_analyzer.py
├── LICENSE
├── .gitignore
├── config.py
├── news_fetcher.py
├── topic_modeler.py
├── main.py
└── README.md


/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DanielPaulDsouza/ibkr-news-analyzer/HEAD/requirements.txt


--------------------------------------------------------------------------------
/sentiment_analyzer.py:
--------------------------------------------------------------------------------
 1 | from textblob import TextBlob
 2 | 
 3 | def analyze_sentiment(text: str) -> tuple[str, float]:
 4 |     """
 5 |     Analyzes the sentiment of a given text.
 6 | 
 7 |     Args:
 8 |         text: The text (headline or article) to analyze.
 9 | 
10 |     Returns:
11 |         A tuple containing the sentiment label ('Positive', 'Negative', 'Neutral')
12 |         and the polarity score (from -1.0 to 1.0).
13 |     """
14 |     if not text:
15 |         return 'Neutral', 0.0
16 | 
17 |     # Create a TextBlob object
18 |     analysis = TextBlob(text)
19 | 
20 |     # Get the polarity score
21 |     polarity = analysis.sentiment.polarity
22 | 
23 |     # Classify the sentiment based on the polarity score
24 |     if polarity > 0.1:
25 |         sentiment = 'Positive'
26 |     elif polarity < -0.1:
27 |         sentiment = 'Negative'
28 |     else:
29 |         sentiment = 'Neutral'
30 | 
31 |     return sentiment, polarity


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2025 Daniel Paul Dsouza
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy
 4 | of this software and associated documentation files (the "Software"), to deal
 5 | in the Software without restriction, including without limitation the rights
 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 7 | copies of the Software, and to permit persons to whom the Software is
 8 | furnished to do so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in all
11 | copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19 | SOFTWARE.


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .nox/
 42 | .coverage
 43 | .coverage.*
 44 | .cache
 45 | nosetests.xml
 46 | coverage.xml
 47 | *.cover
 48 | .hypothesis/
 49 | .pytest_cache/
 50 | 
 51 | # Translations
 52 | *.mo
 53 | *.pot
 54 | 
 55 | # Django stuff:
 56 | *.log
 57 | local_settings.py
 58 | db.sqlite3
 59 | 
 60 | # Flask stuff:
 61 | instance/
 62 | .webassets-cache
 63 | 
 64 | # Scrapy stuff:
 65 | .scrapy
 66 | 
 67 | # Sphinx documentation
 68 | docs/_build/
 69 | 
 70 | # PyBuilder
 71 | target/
 72 | 
 73 | # Jupyter Notebook
 74 | .ipynb_checkpoints
 75 | 
 76 | # IPython
 77 | profile_default/
 78 | ipython_config.py
 79 | 
 80 | # pyenv
 81 | .python-version
 82 | 
 83 | # celery
 84 | celerybeat-schedule
 85 | celerybeat.pid
 86 | 
 87 | # SageMath parsed files
 88 | *.sage.py
 89 | 
 90 | # Environments
 91 | .env
 92 | .venv
 93 | env/
 94 | venv/
 95 | ENV/
 96 | env.bak/
 97 | venv.bak/
 98 | 
 99 | # Spyder project settings
100 | .spyderproject
101 | .spyderworkspace
102 | 
103 | # VSCode settings
104 | .vscode/
105 | 
106 | # Our project specific ignores
107 | reports/
108 | *.csv


--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | 
 3 | # =============================================================================
 4 | # USER CONFIGURATION
 5 | # =============================================================================
 6 | 
 7 | # --- Connection Settings ---
 8 | # Make sure TWS or IB Gateway is running and API connections are enabled.
 9 | IB_HOST = '127.0.0.1'
10 | IB_PORT = 7497  # 7497 for TWS Paper, 7496 for TWS Live, 4002 for IB Gateway Paper, 4001 for IB Gateway Live
11 | CLIENT_ID = 123 # Use a unique client ID for each running script
12 | 
13 | # --- Contract To Analyze ---
14 | # The stock/ETF symbol you want to fetch news for.
15 | # In the CONTRACT_SYMBOLS you can add ONLY ONE or MULTIPLE contracts. 
16 | CONTRACT_SYMBOLS = ['SPY', 'QQQ', 'DIA'] # Add 1 or many symbols, you want.
17 | CONTRACT_TYPE = 'STK' # 'STK' for stock/ETF, 'FUT' for future, etc.
18 | EXCHANGE = 'SMART'
19 | CURRENCY = 'USD'
20 | 
21 | # --- News Search Parameters ---
22 | # If you leave this list empty, then the Matches_Keywords column in the CSV
23 | # will return false.
24 | KEYWORDS_TO_SEARCH = [
25 |     'earnings', 'fed', 'inflation', 'rate cut', 'geopolitical', 'supply chain',
26 |     'buyback', 'guidance', 'downgrade', 'upgrade'
27 | ]
28 | 
29 | # --- Topic Modeling Settings ---
30 | # This defines how many distinct topics the LDA model will try to discover
31 | # in the collection of news articles. There is no single "correct" number.
32 | # - A small number (e.g., 5) will result in very broad, high-level topics.
33 | # - A large number (e.g., 20-30) will result in more specific, granular topics.
34 | NUM_TOPICS = 20
35 | 
36 | # --- Time Frame for News Search ---
37 | # This script uses naive local datetimes
38 | 
39 | # --- CHOOSE YOUR END DATE ---
40 | # This will get the date and time right now in your local timezone.
41 | # We subtract timedelta(days=1) to look at news ending yesterday.
42 | # You can change the number of days to go back further or keep it 
43 | # at 0 to have the end date at current time.
44 | END_DATE = datetime.datetime.now() - datetime.timedelta(days=0)
45 | 
46 | # Set the START_DATE relative to your chosen END_DATE.
47 | # Change this number to adjust to the period you want.
48 | START_DATE = END_DATE - datetime.timedelta(days=100)
49 | 
50 | # --- Output File ---
51 | # The script will create a filename based on the contract, start date and end date,
52 | # and start time and end time.
53 | OUTPUT_DIRECTORY = 'reports' # A subfolder to keep reports organized
54 | 
55 | 
56 | #NOTES
57 | 
58 | """
59 | You can search between specific hours and specific minutes too and aren't limited
60 | to searching between days. See example below for the code modification -
61 | 
62 | END_DATE = datetime.datetime(2025, 7, 5, 16, 0, 0) # July 5, 4:00 PM
63 | # Search for the previous 2 hours
64 | START_DATE = END_DATE - datetime.timedelta(hours=2)
65 | 
66 | """


--------------------------------------------------------------------------------
/news_fetcher.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | from ib_insync import IB, Contract
 3 | 
 4 | 
 5 | def fetch_historical_news(ib: IB, contract_details: dict, start_date: datetime, end_date: datetime) -> list:
 6 |     """
 7 |     Makes a single request to fetch the most recent batch of historical news
 8 |     within a given date range.
 9 | 
10 |     NOTE: The IBKR API limits this request to approximately 300 of the most
11 |     recent articles within the specified timeframe. But I have kept a limit of
12 |     100000
13 | 
14 |     Args:
15 |         ib: An active and connected ib_insync IB instance.
16 |         contract_details: A dictionary with contract info (symbol, type, etc.).
17 |         start_date: The start date for the news search window.
18 |         end_date: The end date for the news search window.
19 | 
20 |     Returns:
21 |         A list of news headline objects found.
22 |     """
23 |     # 1. Get available news providers for the account
24 |     print("Fetching available news providers...")
25 |     providers = ib.reqNewsProviders()
26 |     if not providers:
27 |         print("Error: No news providers found for this account.")
28 |         return []
29 |     provider_codes = '+'.join([p.code for p in providers])
30 |     print(f"Found providers: {[p.code for p in providers]}")
31 | 
32 |     # 2. Qualify the contract to get its conId
33 |     contract = Contract(
34 |         symbol=contract_details['symbol'],
35 |         secType=contract_details['secType'],
36 |         exchange=contract_details['exchange'],
37 |         currency=contract_details['currency']
38 |     )
39 |     ib.qualifyContracts(contract)
40 |     if not contract.conId:
41 |         print(f"Error: Could not resolve contract for {contract_details['symbol']}.")
42 |         return []
43 |     print(f"Successfully qualified contract for {contract_details['symbol']} (conId: {contract.conId})")
44 | 
45 |     # 3. Make a single, direct request for historical news
46 |     start_str = start_date.strftime('%Y%m%d %H:%M:%S')
47 |     end_str = end_date.strftime('%Y%m%d %H:%M:%S')
48 |     print(f"\nRequesting news from {start_str} to {end_str}...")
49 |     print("(Note: API is limited to the ~300 most recent articles in this range)")
50 | 
51 |     try:
52 |         news_headlines = ib.reqHistoricalNews(
53 |             conId=contract.conId,
54 |             providerCodes=provider_codes,
55 |             startDateTime=start_str,
56 |             endDateTime=end_str,
57 |             totalResults=100000
58 |         )
59 |     except Exception as e:
60 |         print(f"  -> API Error fetching headlines: {e}")
61 |         return []
62 | 
63 |     if not news_headlines:
64 |         news_headlines = []
65 | 
66 |     print(f"\nTotal headlines received from API: {len(news_headlines)}")
67 |     return news_headlines
68 | 
69 | 
70 | def get_full_article(ib: IB, headline) -> str:
71 |     """
72 |     Fetches the full text of a single news article, with error handling.
73 | 
74 |     Args:
75 |         ib: An active and connected ib_insync IB instance.
76 |         headline: The news headline object.
77 | 
78 |     Returns:
79 |         The full text of the article as a string.
80 |     """
81 |     try:
82 |         news_article_body = ib.reqNewsArticle(
83 |             providerCode=headline.providerCode,
84 |             articleId=headline.articleId
85 |         )
86 |         return news_article_body.articleText if news_article_body else ""
87 |     except Exception as e:
88 |         # This will catch the "Not allowed" error and handle it gracefully
89 |         # print(f"\nWarning: Could not fetch article {headline.articleId}. Reason: {e}")
90 |         return "Full article text not available (subscription may be required)."


--------------------------------------------------------------------------------
/topic_modeler.py:
--------------------------------------------------------------------------------
  1 | # topic_modeler.py
  2 | 
  3 | import re
  4 | import nltk
  5 | from collections import Counter
  6 | from sklearn.feature_extraction.text import TfidfVectorizer
  7 | from sklearn.decomposition import LatentDirichletAllocation
  8 | 
  9 | def download_nltk_packages():
 10 |     """Checks if necessary NLTK packages are downloaded and gets them if not."""
 11 |     required_packages = ['stopwords', 'punkt', 'wordnet', 'omw-1.4']
 12 |     for package in required_packages:
 13 |         try:
 14 |             if package in ['punkt', 'wordnet', 'omw-1.4']:
 15 |                  nltk.data.find(f'tokenizers/{package}' if package == 'punkt' else f'corpora/{package}')
 16 |             else:
 17 |                  nltk.data.find(f'corpora/{package}')
 18 |         except LookupError:
 19 |             print(f"Downloading NLTK package: {package}...")
 20 |             nltk.download(package)
 21 | 
 22 | def lemmatize_and_tokenize(text: str, lemmatizer):
 23 |     """Tokenizes and lemmatizes a string of text, returning a list of tokens."""
 24 |     tokens = nltk.word_tokenize(text.lower())
 25 |     return [lemmatizer.lemmatize(token) for token in tokens]
 26 | 
 27 | def find_common_phrases(texts: list, n: int = 5, top_k: int = 10) -> list:
 28 |     """Finds the most common n-grams (phrases) to identify boilerplate."""
 29 |     print(f"Identifying top {top_k} common {n}-word phrases to treat as stopwords...")
 30 |     all_ngrams = []
 31 |     for text in texts:
 32 |         # Simple tokenization for n-gram finding
 33 |         tokens = re.findall(r'\b[a-zA-Z]{3,}\b', text.lower())
 34 |         if len(tokens) >= n:
 35 |             for i in range(len(tokens) - n + 1):
 36 |                 all_ngrams.append(tuple(tokens[i:i+n]))
 37 |     
 38 |     most_common = [ngram for ngram, count in Counter(all_ngrams).most_common(top_k)]
 39 |     
 40 |     # Add individual words from the common phrases to the stopword list
 41 |     boilerplate_stopwords = set()
 42 |     for phrase in most_common:
 43 |         print(f"  -> Identified common phrase: {' '.join(phrase)}")
 44 |         for word in phrase:
 45 |             boilerplate_stopwords.add(word)
 46 |     
 47 |     return list(boilerplate_stopwords)
 48 | 
 49 | def perform_topic_modeling(texts: list, num_topics: int) -> tuple[list, list]:
 50 |     """
 51 |     Performs LDA Topic Modeling using advanced pre-processing, including
 52 |     lemmatization, bigrams, and automated boilerplate removal.
 53 | 
 54 |     Args:
 55 |         texts: A list of strings, where each string is an article's content.
 56 |         num_topics: The number of topics to discover.
 57 | 
 58 |     Returns:
 59 |         A tuple containing:
 60 |         - A list of topic IDs for each text.
 61 |         - A list of the top words for each discovered topic.
 62 |     """
 63 |     if not texts:
 64 |         return [], []
 65 | 
 66 |     download_nltk_packages()
 67 | 
 68 |     # --- 1. Advanced Text Cleaning ---
 69 |     print("\nCleaning and pre-processing article text...")
 70 |     
 71 |     # Automatically find and add common boilerplate phrases to the stopword list
 72 |     boilerplate_words = find_common_phrases(texts)
 73 |     
 74 |     lemmatizer = nltk.stem.WordNetLemmatizer()
 75 |     nltk_stop_words = list(nltk.corpus.stopwords.words('english'))
 76 | 
 77 |     # --- Create a custom list of words to ignore ---
 78 |     # Add any other meaningless words you find to this list.
 79 |     # This list WILL always be never-ending .... :p
 80 |     CUSTOM_STOP_WORDS = [
 81 |         'com', 'story', 'news', 'fly', 'edt', 'theflyonthewall', '00', 'yet',
 82 |         'copyright', 'free', '30', 'br', 'apos', 'www', 'wsj', 'writes', 'take',
 83 |         'likely', 'wants', 'et', 'according', 'would', 'basis', 'due', 'god',
 84 |         'bless', 'https', 'states', 'starting', 'sent', 'instead', 'see', 'thefly',
 85 |         'go', 'rest', 'permalinks', 'entry', 'php', 'another', 'event', 'events',
 86 |         'like', 'well', 'may', 'us', 'final', 'noted', 'read', 'minutes', 'finished',
 87 |         'last', '000', 'years', 'year', 'plans', 'set', 'weeks', 'reference', 'href',
 88 |         'called', 'effects', 'near', 'says', 'say', 'make', '000', 'said', 'remains', 'also',
 89 |         'seen', 'get', 'time', 'generally', 'looking', 'nice', 'post', 'yesterday',
 90 |         'working', 'worked', 'works', 'made', 'great', 'gov', 'briefing', 'authors',
 91 |         'div', 'following', 'told', 'made', 'tell', 'comments', 'good', 'speaking',
 92 |         'http', 'able', 'place', 'many', 'slipped', 'shed', 'rose', 'higher', 'lower',
 93 |         'gains', 'falls', 'rising', 'falling', 'snapped', 'climbs', 'declines', 'closes',
 94 |         'fox', 'reuters', 'tells', 'interview', 'bring', 'reporter', 'work', 'long',
 95 |         'effect', 'previously', 'move', 'going', 'mod', 'link', 'avoiding', 'new',
 96 |         'old', 'done', 'want', 'along', 'accept', 'could', 'stance', 'announces',
 97 |         'meanwhile', 'marginally', 'fresh', 'buzz', 'dow', 'jones', 'trading', 'share',
 98 |         'pre', 'believed', 'method', 'expected', 'several', 'suggested', 'observed',
 99 |         'saying', 'give', 'really', 'earlier', 'think', 'live', 'know', 'held',
100 |         'familiar', 'include', 'citing', 'keep', 'know', 'opted', 'among', 'known',
101 |         'slightly', 'stated', 'shame', 'amp'
102 |     ]
103 |     
104 |     stop_words = nltk_stop_words + CUSTOM_STOP_WORDS + boilerplate_words
105 | 
106 |     # --- 2. Lemmatization ---
107 |     # Lemmatize AFTER finding boilerplate to catch the original phrases
108 |     lemmatized_texts = [' '.join(lemmatize_and_tokenize(text, lemmatizer)) for text in texts]
109 | 
110 |     # --- 3. Vectorization using TF-IDF and Bigrams ---
111 | 
112 |     print(f"Performing Topic Modeling to discover {num_topics} topics...")
113 | 
114 |     # TF-IDF is more advanced than a simple count. It weighs words based on
115 |     # how important they are to a specific document, not just how frequent they are.
116 |     # Also, we ignore words that appear in less than 2 documents or more than 85% of documents.
117 |     # The token_pattern considers words of 3+ letters.
118 |     vectorizer = TfidfVectorizer(
119 |         max_df=0.85, min_df=2, stop_words=stop_words,
120 |         token_pattern=r'\b[a-zA-Z]{3,}\b'
121 |     )
122 | 
123 |     dtm = vectorizer.fit_transform(lemmatized_texts)
124 | 
125 |     # 4. Build and fit the LDA model
126 |     lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
127 |     lda.fit(dtm)
128 | 
129 |     # 5. Get the dominant topic for each document
130 |     topic_results = lda.transform(dtm)
131 |     dominant_topic_per_document = topic_results.argmax(axis=1)
132 | 
133 |     # 6. Get the top words/phrases for each topic for display
134 |     feature_names = vectorizer.get_feature_names_out()
135 |     top_words_per_topic = []
136 |     for topic_idx, topic in enumerate(lda.components_):
137 |         # Get the top 10 words/phrases for this topic
138 |         top_words = [feature_names[i].replace(' ', '_') for i in topic.argsort()[:-10 - 1:-1]]
139 |         top_words_per_topic.append(top_words)
140 |         print(f"  -> Discovered Topic #{topic_idx}: {', '.join(top_words)}")
141 | 
142 |     return dominant_topic_per_document.tolist(), top_words_per_topic


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import datetime
  3 | import pandas as pd
  4 | from ib_insync import IB, util
  5 | 
  6 | # Import our custom modules and configuration
  7 | import config
  8 | from news_fetcher import fetch_historical_news, get_full_article
  9 | from sentiment_analyzer import analyze_sentiment
 10 | from topic_modeler import perform_topic_modeling
 11 | 
 12 | def main():
 13 |     """
 14 |     Main function to orchestrate the news fetching and analysis process.
 15 |     """
 16 |     ib = IB()
 17 |     all_symbols_results = [] # Master list to hold results from all symbols
 18 | 
 19 |     try:
 20 |         # --- Connect to IBKR ---
 21 |         print(f"Connecting to IBKR on {config.IB_HOST}:{config.IB_PORT}...")
 22 |         ib.connect(config.IB_HOST, config.IB_PORT, clientId=config.CLIENT_ID)
 23 |         print("Connection successful.")
 24 | 
 25 |         # Loop through each symbol specified in the config file
 26 |         for symbol in config.CONTRACT_SYMBOLS:
 27 |             print(f"\n{'='*20} Processing: {symbol} {'='*20}")
 28 |         
 29 |             # --- Fetch News Headlines ---
 30 |             contract_details = {
 31 |                 'symbol': symbol,  # Use the symbol from the loop
 32 |                 'secType': config.CONTRACT_TYPE,
 33 |                 'exchange': config.EXCHANGE,
 34 |                 'currency': config.CURRENCY
 35 |             }
 36 |             all_headlines = fetch_historical_news(ib, contract_details, config.START_DATE, config.END_DATE)
 37 | 
 38 |             if not all_headlines:
 39 |                 print(f"No headlines found for {symbol}. Skipping.")
 40 |                 continue
 41 | 
 42 |             # --- Analyze ALL articles and flag keyword matches ---
 43 |             print(f"\nAnalyzing all {len(all_headlines)} articles for {symbol}...")
 44 |             
 45 |             # Sort headlines by time, newest first
 46 |             sorted_headlines = sorted(all_headlines, key=lambda h: h.time, reverse=True)
 47 | 
 48 |             # --- Batch processing logic to avoid rate limiting ---
 49 |             BATCH_SIZE = 200  # Process 200 articles at a time
 50 |             BATCH_PAUSE = 2 # Pause for 2 seconds between batches
 51 | 
 52 |             for i in range(0, len(sorted_headlines), BATCH_SIZE):
 53 |                 batch = sorted_headlines[i:i + BATCH_SIZE]
 54 |                 print(f"\n--- Processing Batch {i//BATCH_SIZE + 1}/{len(sorted_headlines)//BATCH_SIZE + 1} ---")
 55 | 
 56 |                 for headline in batch:
 57 |                     if not (config.START_DATE <= headline.time <= config.END_DATE):
 58 |                         continue
 59 |                 
 60 |                     # Get full article text
 61 |                     article_text = get_full_article(ib, headline)
 62 |                     content_to_search = (headline.headline + ' ' + article_text).lower()
 63 | 
 64 |                     # Determine if the article matches the keywords
 65 |                     matches_keywords = any(keyword.lower() in content_to_search for keyword in config.KEYWORDS_TO_SEARCH)
 66 |                 
 67 |                     # Perform sentiment analysis on every article
 68 |                     sentiment, polarity = analyze_sentiment(content_to_search)
 69 | 
 70 |                     # Add to results
 71 |                     ts = headline.time
 72 |                     all_symbols_results.append({
 73 |                         'Symbol': symbol,
 74 |                         'Date': ts.strftime('%Y-%m-%d'),
 75 |                         'Time': ts.strftime('%H:%M:%S'),
 76 |                         'Provider': headline.providerCode,
 77 |                         'Matches_Keywords': matches_keywords,
 78 |                         'Sentiment': sentiment,
 79 |                         'Polarity': round(polarity, 4),
 80 |                         'Headline': headline.headline,
 81 |                         'Article': article_text.replace('\n', ' ').strip()
 82 |                     })
 83 |                 
 84 |                     # progress indicator for batches
 85 |                     print(f"  -> Processed: {headline.headline[:60]}...", end='\r')
 86 |                     ib.sleep(0.1) # Small pause between each article
 87 | 
 88 |                 # After a batch is done, check if it's not the very last one
 89 |                 if i + BATCH_SIZE < len(sorted_headlines):
 90 |                     print(f"\n--- Batch complete. Pausing for {BATCH_PAUSE} seconds to respect API limits... ---")
 91 |                     ib.sleep(BATCH_PAUSE)
 92 |             
 93 |             print(f"\n\n{'='*20} Finished processing: {symbol} {'='*20}")
 94 | 
 95 |             # Don't pause after the last symbol
 96 |             if symbol != config.CONTRACT_SYMBOLS[-1]:
 97 |                 print("Pausing before next symbol to respect API rate limits...")
 98 |                 ib.sleep(5) #Longer pause between symbols
 99 | 
100 |         # --- Perform Topic Modeling on ALL collected articles ---
101 |         if all_symbols_results:
102 |             # Create a list of just the article texts to feed into the model
103 |             all_article_texts = [result['Article'] for result in all_symbols_results]
104 | 
105 |             # --- Define date strings once for use in all filenames ---
106 |             start_str = config.START_DATE.strftime('%Y%m%d-%H%M%S')
107 |             end_str = config.END_DATE.strftime('%Y%m%d-%H%M%S')
108 | 
109 |             # Perform the topic modeling
110 |             topic_ids, topics = perform_topic_modeling(all_article_texts, num_topics=config.NUM_TOPICS)
111 | 
112 |             # Add the discovered topic ID to each result
113 |             for i, result in enumerate(all_symbols_results):
114 |                 result['Topic_ID'] = topic_ids[i]
115 | 
116 |             # --- Save the topic summary to a text file ---
117 |             # Create a formatted string with the topic details
118 |             topic_summary_lines = ["--- Discovered Topic Summary ---"]
119 |             for topic_id, top_words in enumerate(topics):
120 |                 topic_summary_lines.append(f"Topic #{topic_id}: {', '.join(top_words)}")
121 |             
122 |             # Define the path for the summary file
123 |             summary_filename = f"topic_summary_from_{start_str}_to_{end_str}.txt"
124 |             summary_filepath = os.path.join(config.OUTPUT_DIRECTORY, summary_filename)
125 |             
126 |             # Write the summary to the file
127 |             with open(summary_filepath, 'w') as f:
128 |                 f.write('\n'.join(topic_summary_lines))
129 |             print(f"\nSaved topic summary to '{summary_filepath}'")
130 |         
131 |         # --- Save Combined Report to CSV ---
132 |         if all_symbols_results:
133 |             print("\nSaving combined report to CSV...")
134 |             if not os.path.exists(config.OUTPUT_DIRECTORY):
135 |                 os.makedirs(config.OUTPUT_DIRECTORY)
136 | 
137 |             filename = f"news_report_combined_from_{start_str}_to_{end_str}.csv"
138 |             filepath = os.path.join(config.OUTPUT_DIRECTORY, filename)
139 | 
140 |             # Create and save DataFrame
141 |             df = pd.DataFrame(all_symbols_results)
142 |             # Reorder columns to put Symbol first
143 |             df = df[['Symbol', 'Date', 'Time', 'Provider', 'Matches_Keywords', 'Topic_ID', 'Sentiment', 'Polarity', 'Headline', 'Article']]
144 |             df.to_csv(filepath, index=False, encoding='utf-8')
145 |             print(f"\nSuccessfully saved the report to '{os.path.abspath(filepath)}'")
146 |         else:
147 |             print("\nNo articles found across all symbols. No CSV file was generated.")
148 | 
149 |     except ConnectionRefusedError:
150 |         print(f"\nError: Connection refused. Is TWS or IB Gateway running on {config.IB_HOST}:{config.IB_PORT}?")
151 |     except Exception as e:
152 |         print(f"\nAn unexpected error occurred: {e}")
153 |     finally:
154 |         if ib.isConnected():
155 |             print("Disconnecting from IBKR.")
156 |             ib.disconnect()
157 | 
158 | if __name__ == "__main__":
159 |     # ib_insync requires an asyncio event loop to run.
160 |     # util.startLoop() is a helper for running it in scripts.
161 |     util.startLoop()
162 |     main()


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # IBKR Historical News Analyzer
  2 | 
  3 | A powerful and robust Python tool to fetch, analyze, and perform advanced topic modeling & sentiment analysis on historical news data from Interactive Brokers.
  4 | 
  5 | This repository contains the full development history of the project. The latest stable version is **v1.1**.
  6 | 
  7 | ---
  8 | 
  9 | ### Official Releases
 10 | 
 11 | You can browse the code, documentation, and download the source for each official version by clicking the links below.
 12 | 
 13 | | Version | Key Feature                 | Browse Files & README                                                      | View Release Notes & Downloads                                             |
 14 | | :------ | :-------------------------- | :------------------------------------------------------------------------- | :------------------------------------------------------------------------- |
 15 | | **V1.1** | **Advanced Topic Modeling** | [**Browse V1.1 Files**](https://github.com/DanielPaulDsouza/ibkr-news-analyzer/tree/v1.1) | [**V1.1 Release Notes**](https://github.com/DanielPaulDsouza/ibkr-news-analyzer/releases/tag/v1.1) |
 16 | | **V1.0** | **Stable Harvester** | [Browse V1.0 Files](https://github.com/DanielPaulDsouza/ibkr-news-analyzer/tree/v1.0)   | [V1.0 Release Notes](https://github.com/DanielPaulDsouza/ibkr-news-analyzer/releases/tag/v1.0)   |
 17 | 
 18 | ---
 19 | 
 20 | ## About the Latest Version (V1.1)
 21 | 
 22 | This project has evolved from a simple data harvester into a sophisticated analysis engine. It connects to the IBKR API, downloads news for multiple symbols over a specified date range, and then applies a professional-grade Natural Language Processing (NLP) pipeline to each article. It also analyzes every article for sentiment and flags articles that match your keywords. The final output is a single, rich CSV file containing sentiment scores, keyword flags, and a discovered "Topic ID" for each article, enabling deep thematic analysis, further analysis or visualization.
 23 | 
 24 | ## Features
 25 | 
 26 | -   **Multi-Contract Support:** Fetch news for multiple symbols (e.g., 'SPY', 'QQQ', 'AAPL') in a single, automated run.
 27 | -   **Robust API Rate-Limit Handling:** Politely handles API limits by processing articles in configurable batches with pauses, ensuring reliable data collection without being blocked.
 28 | -   **✨ New in V1.1: Advanced NLP Pre-processing:** Utilizes a professional pipeline including boilerplate removal, lemmatization (reducing words to their root form), and bigram detection to produce cleaner data for analysis.
 29 | -   **✨ New in V1.1: Advanced Topic Modeling:** Implements Latent Dirichlet Allocation (LDA) with TF-IDF vectorization to automatically discover and categorize the underlying themes in the news articles.
 30 | -   **Sentiment Scoring:** Uses `TextBlob` to perform sentiment analysis on every article, providing `Sentiment` (Positive, Negative, Neutral) and `Polarity` score columns.
 31 | -   **Keyword Flagging:** Includes a `Matches_Keywords` (True/False) column. This allows you to either analyze all news for a symbol or easily filter for articles relevant to your specific interests in a downstream tool like Pandas or Excel.
 32 | -   **Fully Configurable:** Easily change all parameters (dates, keywords, contract symbols, batch sizes, number of topics for LDA, etc.) in a simple `config.py` file.
 33 | -   **Combined Outputs:** Saves all results from all symbols into a single, detailed CSV file and creates a separate `topic_summary.txt` file that describes the discovered topics for each run.
 34 | 
 35 | ## What's New in v1.1: Advanced Topic Modeling
 36 | 
 37 | This version introduces a powerful topic modeling feature using Latent Dirichlet Allocation (LDA) with an advanced pre-processing pipeline.
 38 | 
 39 | -   **Automated Boilerplate Detection:** Automatically finds and removes common, repetitive phrases (e.g., news provider disclaimers) to reduce noise.
 40 | 
 41 | -   **Advanced Text Cleaning:** Uses Lemmatization and Bigram detection to treat words like "rates" and "rate" as the same concept, and to understand that "rate_hike" is a single, important idea.
 42 | 
 43 | -   **TF-IDF Vectorization:** Employs TF-IDF to intelligently weigh words, giving more importance to terms that are significant to a specific document rather than just frequent overall.
 44 | 
 45 | -   **Tunable Model:** Allows the user to easily configure the number of topics to discover, enabling both high-level and granular thematic analysis.
 46 | 
 47 | ## How does v1.1 Work
 48 | 
 49 | The script connects to a running instance of Interactive Brokers' Trader Workstation (TWS) or IB Gateway. For each symbol in the configuration, it makes a request for the ~300 most recent historical news headlines within the specified date range. It then processes these headlines in batches, requesting the full article text for each one while pausing to respect API rate limits. Each article is then cleaned using advanced NLP techniques. This cleaned articles are analyzed using LDA and TF - IDF to discover and categorize underlying themes.
 50 | 
 51 | The articles are further analyzed for sentiment and checked against the keyword list before being added to a master results list, which is then saved to a single CSV file.
 52 | 
 53 | ## How Topic Modeling Works & Tuning Guide
 54 | 
 55 | The script first collects all news articles. Then, it cleans the text by removing boilerplate, common "stopwords," and noise like numbers. It then uses TF-IDF to represent the importance of words in each document. Finally, the LDA algorithm analyzes these representations to discover a set number of underlying "topics" (i.e., clusters of words that frequently appear together). Each article in the final CSV is assigned a `Topic_ID` corresponding to its dominant theme.
 56 | 
 57 | ### Tuning Your Topic Model
 58 | 
 59 | The quality of these topics is highly dependent on a few key settings which you can tune in your project:
 60 | 
 61 | * **`NUM_TOPICS`** (in `config.py`): This is the most important setting. It defines how many distinct themes the model should look for. A smaller number (~10-15) will produce broad themes. A larger number (25+) will produce more specific, granular themes. It is recommended to start with a smaller number and increase it as you refine your data cleaning.
 62 | * **`CUSTOM_STOP_WORDS`** (in `topic_modeler.py`): This is a powerful list where you can add domain-specific words you want the model to ignore. This is the best place to add generic market commentary verbs (e.g., `rose`, `fell`, `climbed`) or other noise you discover.
 63 | * **`max_df` / `min_df`** (in `topic_modeler.py`): These parameters in the `TfidfVectorizer` are powerful filters.
 64 |     * `max_df=0.85` tells the model to ignore words that appear in more than 85% of all articles. Lowering this value is an effective way to remove overly common words and force the model to find more nuanced themes.
 65 |     * `min_df=2` tells the model to ignore words that appear in fewer than 2 documents, which helps remove rare words and potential typos.
 66 | 
 67 | ## Architectural Choice: `ib_insync` vs. Native `ibapi`
 68 | 
 69 | This project is built using the `ib_insync` library rather than the native `ibapi` for a specific architectural reason.
 70 | 
 71 | * The **native `ibapi`** is a low-level, purely **event-driven** system. It is powerful for real-time applications that need to react instantly to live data streams pushed from the server.
 72 | * **`ib_insync`** is a higher-level library that provides a clean, **synchronous-style (request/response)** interface. It handles the complex asynchronous background work, making it the ideal tool for tasks that follow a linear logic, such as "request a list of data, then process each item."
 73 | 
 74 | For this project, which focuses on harvesting and analyzing historical data, the request/response model of `ib_insync` is the superior choice. It allows the code to be simpler, more readable, and more focused on the core tasks of data processing and analysis, rather than on managing complex event loops and callbacks.
 75 | 
 76 | ## Prerequisites
 77 | 
 78 | -   Python 3.8+
 79 | -   An Interactive Brokers account (live or paper)
 80 | -   Trader Workstation (TWS) or IB Gateway installed and running.
 81 | 
 82 | ## Setup & Installation
 83 | 
 84 | 1.  **Clone the repository (or download the ZIP):**
 85 |     ```bash
 86 |     git clone https://github.com/DanielPaulDsouza/ibkr-news-analyzer.git
 87 |     cd ibkr-news-analyzer
 88 |     ```
 89 | 
 90 | 2.  **Create a virtual environment (recommended):**
 91 |     ```bash
 92 |     python -m venv venv
 93 |     source venv/bin/activate  # On Windows use `venv\Scripts\activate`
 94 |     ```
 95 | 
 96 | 3.  **Install the required libraries:**
 97 |     ```bash
 98 |     pip install -r requirements.txt
 99 |     ```
100 | 
101 | ## How to Use
102 | 
103 | 1.  **Log in to TWS or IB Gateway.** Make sure the API connection is enabled.
104 |     -   In TWS: Go to `File -> Global Configuration -> API -> Settings` and check "Enable ActiveX and Socket Clients".
105 | 
106 | 2.  **Edit the configuration file.** Open `config.py` and adjust the settings to your needs. You can change the date range, the keywords, and the list of stock symbols.
107 | 
108 | 3.  **Run the application.** From your terminal, simply run:
109 |     ```bash
110 |     python main.py
111 |     ```
112 | 
113 | 4.  **Check the output.** The script will print its progress in the terminal. Once finished, you will find a CSV file & a .txt file in the `reports` folder with the combined, analyzed news data.
114 | 
115 | ## Output CSV Columns
116 | 
117 | | Column           | Description                                                              |
118 | | ---------------- | ------------------------------------------------------------------------ |
119 | | `Symbol`         | The contract symbol (e.g., 'SPY') the news is for.                       |
120 | | `Date`           | The publication date of the article.                                     |
121 | | `Time`           | The publication time of the article.                                     |
122 | | `Provider`       | The news provider code (e.g., 'FLY', 'BRFG').                            |
123 | | `Matches_Keywords` | `True` if the article contains any of your keywords, otherwise `False`.    |
124 | | **`Topic_ID`** | **(New in V1.1)** The ID of the topic cluster the article belongs to.      |
125 | | `Sentiment`      | The sentiment classification: 'Positive', 'Negative', or 'Neutral'.      |
126 | | `Polarity`       | The sentiment polarity score (from -1.0 for negative to 1.0 for positive). |
127 | | `Headline`       | The headline of the news article.                                        |
128 | | `Article`        | The full text of the news article (or an error message if unavailable).  |
129 | 
130 | ## Project Roadmap (Future Development)
131 | 
132 | This project is under active development. The following major features are planned for future versions:
133 | 
134 | ### V2.0: The Advanced Analyzer
135 | 
136 | -   **Feature:** Upgrade the sentiment analysis engine from `TextBlob` to a finance-specific transformer model like `FinBERT`. Merge the Topic Modeling and FinBERT features into a single, powerful pipeline.
137 | -   **Goal:** To create a comprehensive analysis tool that can not only determine the sentiment of news with high accuracy but also identify the specific economic or financial themes driving that sentiment.
138 | 
139 | ## Disclaimer
140 | 
141 | This tool is for educational and informational purposes only. Financial markets are complex and risky. Past performance is not indicative of future results. The author is not responsible for any financial losses incurred as a result of using this software. Always do your own research.


--------------------------------------------------------------------------------