← ahead.market blog
Structuring News Flow for Sentiment and Impact Analysis

Introduction

To make informed investment decisions, it is crucial to stay updated on the latest economic trends, market movements, individual companies, industries, and more. Political events, company announcements, regulatory changes, and other significant news must all be considered.

Large banks and hedge funds employ numerous people to process news streams and respond appropriately to events. Smaller companies and independent traders lack the resources to cover such a vast amount of news, forcing them to spend a significant portion of their time on this task.

Everyone invests a lot of time in this process, but with limited event coverage since it's impossible to review everything. The ability of large language models (LLMs) to effectively process textual information opens up new possibilities for significantly enhancing this process.

In this article, we discuss how we tackled this challenge for a private hedge fund.


Data Sources

When collecting data and preparing it for subsequent analysis, it is crucial to differentiate original sources from reprints, interpretations, opinions, etc. Data from original sources should carry more weight during subsequent processing and analysis.
  • Major News Providers
    Primary data sources with broad coverage, high quality, and fast delivery.
    Examples: Bloomberg, Reuters, etc.
  • News Wire Services
    Publish official press releases, providing primary source information.
    Examples: BusinessWire, PR Newswire, GlobeNewswire, etc.
  • Specialized Financial Publications
    Sources of analytical information, research reports, and opinions from influential figures. Examples: Wall Street Journal, The Economist, Financial Times, etc.
  • Financial Media Portals and Aggregators
    Offer editorial content and materials, gather data from various sources, and provide collaborative analytics and opinions.
    Examples: Yahoo Finance, Barron's, The Motley Fool, Investing.com, etc.
  • General News Aggregators
    Collect news from multiple sources and present it in one place for easy access.
    Examples: Google News, Flipboard, etc.
  • General News Services
    Broad coverage of news across various sectors, often including finance.
    Examples: CNN, BBC, ABC News, etc.
  • Social Media
    Platforms where news spreads quickly and investors can gauge public sentiment.
    Examples: Twitter, LinkedIn, Reddit, etc.

Information Collection Methods

  • API

    A primary method for working with paid services, which can use either push or pull models.

    For time critical data processing, push might be preferred, depending on the service's implementation.
  • RSS Feeds

    A widespread method supported by most websites, allowing for quick reception of headlines and descriptions.

    It's very versatile and easy to implement, though it doesn't provide access to the full content.
  • Social Networks APIs

    All social networks offer methods for working through programmatic interfaces.

    These often have strict rate limits, which prevent gathering large amounts of information but are excellent for targeted monitoring of specific news streams.
  • Web Crawling

    The least reliable and most time-consuming method. It is crucial to consider terms and conditions, as not all websites allow content scraping.

    Despite its challenges and limitations, this method is indispensable for achieving broad coverage, especially when finding rare information not covered by major news agencies.

    This is particularly important when dealing with news about small-cap companies.
Main Objectives of Information Collection and Processing

In the project described in this article, the client's primary objectives were:
  • Automation of News Sentiment Analysis for Individual Companies

    While there are existing solutions for sentiment analysis, using LLMs for this task offers new possibilities compared to traditional NLP algorithms.

    Beyond the language tone (negative/neutral/positive), LLMs' semantic analysis capabilities allow for determining the news's impact, significance, and which specific business segment it affects.
  • Automation of News Flow Analysis by Specific Segments
    Many specialized resources allow for effective filtering of news and events by broad categories, such as market, economy, and company. However, there's no convenient means for monitoring news by more specific segments.

    General news aggregators (like Google) offer broader filtering capabilities, but this leads to the need to sift through vast amounts of irrelevant information for investment purposes.
  • Development of an Interactive Analytical Tool
    To aid analysts in tracking major news events through custom semantic filters.
Results of news flow analysis can be used for:
1

Developing independent trading strategies

2

Enhancing risk management systems for both automated and manual trading

3

Accelerating analysts, traders, and portfolio managers' work in finding new ideas and updating their understanding of the current situation


Technological Implementation Details


The news flow analysis process consists of the following main steps:

Pre-processing of Individual News

Analysis of News Groups

Deduplication, Filtering, Identifying Key Events, Determining Interconnections

Processing User Requests

Information Search and Retrieval


This step utilizes methods discussed in the "Information Collection Methods" section.

Key tasks include:

1. Prioritizing Information for Processing
Resource constraints make it impractical and economically inefficient to process all available information. Prioritization is essential due to the time factor, as urgent news must be processed first.

Effective strategies include:
  • Scoring potential significance using LLMs based on headlines and brief descriptions, which is faster than processing the entire context and can save resources.
  • Deduplication through semantic search (if the original news has already been processed, interpretations/reprints can be postponed).
  • Using market information to manage processing priorities. Significant price changes should prioritize information processing for those companies.

2. Discovery of Additional, Unique Information
When potentially interesting events are detected, LLMs can generate additional search engine queries to expand the analysis context and improve processing results.

Pre-processing Individual News

Determine the type
of received information

e.g., official company press release, analysts' opinion, lawsuit, political news, economic news, etc.

Identify main topics and keywords
Highlight significant content

dates, numerical indicators, key entities

(companies, people, institutions, etc.)

Generate embeddings for subsequent semantic search based on the highlights
It's crucial to use appropriate embedding methods, as various models have different tasks. In this case, multiple embedding models were used: one for clustering and another for semantic search.

The information is stored in vector storage. Choosing the right vector storage implementation is vital, considering the need to store a large amount of metadata alongside the embeddings.

Effective handling of metadata is essential for search and filtering tasks.

Based on our experience, many popular vector databases struggle with this task, and performance drops significantly with relatively small data volumes (~50GB). While we don't publicly share our benchmarks, we are open to discussing our experiences upon request.
Analyzing News Groups
Topic Modeling
Necessary for structuring collected information, which is crucial for subsequent analysis and identifying
the most actively discussed topics.

There are many algorithms for this task, such as LDA (Latent Dirichlet
Allocation) and LDA + LLM implementations.

Our experience shows that iterative clustering of embeddings followed by LLM analysis for topic formation
yields the best results.

This approach is much more efficient and produces higher-quality topic modeling
results.
Deduplication
Essential for performance, as there's no point in processing the same information multiple times.

However, it's crucial not to lose information; different articles may present the same facts but draw opposite conclusions, which is important for subsequent analysis. Tracking publication timestamps helps monitor the original source of information.

Deduplicated information should retain all sources so that analysts can trace everything back. The number
and quality of information sources also serve as an additional criterion for the event's significance, analogous
to PageRank used by search engines
Updating the Financial Knowledge Graph
This graph models the interconnections of all collected information. It enhances the algorithmic efficiency of search and information processing and allows LLMs to automatically seek necessary information for event analysis and potential impacts.
Processing User Requests
Request Analysis
Using LLMs to determine the request type, normalize it from a free-text form, and identify key entities (e.g., whether the request pertains to a specific company or multiple companies).
Analysis Plan Creation
Identify necessary information, comparisons, and conclusions, and how the result will be presented. The analysis plan is formed based on the request type using LLMs.
Searching Relevant Information and Filling the Context
Generate queries to the knowledge graph based on the analysis plan using LLMs. The LLM is provided with context information about the types of entities and relations in the knowledge graph.
Information Analysis
Since the volume of information extracted for analysis can exceed the LLM model's context size, an iterative process of analysis is necessary. This should be considered during plan formation. The iterative process can be two-pass: compressing information to get a broad view first, then conducting a detailed analysis of the most significant parts
Generating a User Response
The LLM generates a response for the user based on the analysis results. The response format and content should align with the user's goals identified during the analysis phase.
Key Results
To meet the objective, we needed to cover approximately 4,000 different tickers traded on NYSE and NASDAQ. The system processes information from over 11,000 unique sources to achieve this.

The primary economic benefit of implementing such a system is the accelerated work of analysts and portfolio managers. The cost of implementing and maintaining this system is significantly lower than analysts' salaries, and its implementation drastically improves their work speed and quality. Unfortunately, we lack sufficient data for a statistically significant study, so we rely solely on subjective assessments here.
MORE ARTICLES