Twitter / X

🎯 This data dump is based on my sentiment & stock market volatility project at CMU. The objective is to find features relevant for predictive modeling.

Data Overview

Twitter, now called X, is a fast-paced microblogging platform where users, including retail traders, analysts, journalists, and company executives, post real-time messages (“tweets”) about financial markets, earnings news, product updates, and macroeconomic events. This makes it a valuable source of short-form, high-frequency sentiment signals.

This now-infamous tweet by Elon Musk on May 1, 2020—stating “Tesla stock price is too high imo”—caused Tesla’s stock (TSLA) to drop over 10% intraday, wiping billions off the company’s market cap. It’s a clear example of how social media, especially when used by influential figures, can immediately impact equity prices. Tweets like this highlight the importance of monitoring real-time sentiment and executive commentary as part of any market-aware trading strategy.

During events like the COVID-19 pandemic, Twitter discussions significantly influenced stock price movements, especially for small-cap and high-volatility stocks like AMC and Zoom, as shown in our analysis.

In the context of market prediction, Twitter sentiment captures crowd reactions, rumors, and breaking news with immediacy that can outperform traditional news outlets. In my CMU project, I found that models trained on Twitter data, especially when combined with news headlines, provided the best predictive performance for next-day stock price differences. The platform’s real-time nature, diversity of perspectives, and direct expression of opinion enable trading systems to exploit temporary mispricings before markets fully absorb new information. However, extracting useful insights from Twitter requires specialized NLP models (e.g., VADER, FinBERT), noise filtering, and models that can handle sarcasm, slang, and contextual shifts.

A typical tweet record includes:

Meta-data: tweet_id, timestamp, user info, location (if available)
Content: raw text, hashtags, ticker mentions, cashtags (e.g., $AAPL)
Engagement: retweets, likes, replies, user follower count
Derived fields: sentiment score, entity recognition (e.g., company names), topics

Data is typically collected using Twitter’s API or third-party aggregators and stored as JSON or parquet for downstream analytics.

Latency

Near Real-Time: Twitter data can be accessed seconds after posting, allowing sub-minute granularity for algorithmic trading.

Historical Backfill: Providers offer delayed or complete archives for backtesting.

This minimal latency is ideal, as it can capture breaking news or sentiment shifts before markets fully react.

Data Processing Pipeline

This is an overview of what the pipeline could look like as part of a first-draft requirements sheet. Teams should refine based on tech stack and custom needs.

Preprocessing Steps:

Cleaning: Remove spam, deduplicate, and filter irrelevant tweets.
Entity Recognition: Map cashtags/mentions to valid tickers using NLP.
Language Detection & Filtering: Focus on English or relevant market-language tweets.
Sentiment & Emotion Analysis: Use lexicons, ML models, or large language models (LLMs) for rating sentiment (positive/negative/neutral) and extracting emotion factors.
Manual/ML Validation: Outlier and bot detection; remove coordinated manipulation.
Temporal Aggregation: Summarize activity at minute or hour intervals per ticker.
Feature Engineering: Aggregate tweet volume, average sentiment, engagement, etc.

ML/LLM Use Cases:

LLMs: Fine-tuned models can provide context-aware sentiment classification, topic modeling, and anomaly detection.
Clustering: Segment tweets by theme or account type (e.g., financial influencers, news bots, retail traders).

Features for Predictive Modeling

{
"date": "2025-07-15",
"ticker": "TSLA",
"window_start": "2025-07-15T13:50:00Z",
"window_end": "2025-07-15T14:00:00Z",
"tweet_volume": 3248,
"unique_users": 2112,
"average_sentiment": 0.61,
"positive_tweet_ratio": 0.74,
"negative_tweet_ratio": 0.08,
"neutral_tweet_ratio": 0.18,
"engagement": {
"total_likes": 15192,
"total_retweets": 5640,
"total_replies": 1748
},
"top_topics": ["earnings", "delivery", "regulation"],
"emotion_distribution": {
"confidence": 0.20,
"excitement": 0.17,
"optimism": 0.11,
"positive_surprise": 0.09,
"skepticism": 0.07,
"curiosity": 0.05,
"concern": 0.04,
"doubt": 0.03,
"confusion": 0.03,
"frustration": 0.03,
"negative_surprise": 0.02,
"neutral": 0.13,
"acknowledgement": 0.03
},
"influencer_tweet_ratio": 0.11,
"bot_tweet_ratio": 0.21,
"sentiment_volatility": 0.14,
"top_influencer_sentiment": 0.72,
"spike_flag": true,
"lagged_return_5m": 0.6,
"lead_return_5m": 1.3
}

Feature Explanation by Group

Meta & Windowing: date, window_start, window_end, ticker
Volume & User Diversity: tweet_volume, unique_users
Sentiment & Distribution: Average sentiment, ratios of positive/negative/neutral
Engagement Metrics: Total likes, retweets, replies
Topic/Theme Analysis: top_topics
Emotions: Detailed per-emotion distribution (e.g., confidence, concern, positive/negative surprise, excitement)
Source Quality: influencer_tweet_ratio (share from verified/high-follower users), bot_tweet_ratio
Volatility/Change: sentiment_volatility (intra-window), spike_flag (anomaly marker)
Return Alignment: lagged_return_5m, lead_return_5m for event study modeling.

Alpha Hypotheses

News Diffusion Speed: Twitter often surfaces events (earnings leaks, product updates, M&A rumors) before mainstream media, enabling early trade entry.

Sentiment Shocks: Rapid sentiment or volume spikes can predict short-term volatility or drift in returns.
Emotion-Informed Trading: Dominance of emotions like confidence, skepticism, or surprise can reflect market expectations or breaking narrative shifts, offering intraday alpha signals.
Source Signals: Influencer and official account activity correlates with sharp, broad-based price moves.

Risks and Mitigation

Noisy Data & Manipulation: Significant bot activity, spam, and coordinated campaigns can obscure true market signals
Short-lived Opportunity: Arbitrage windows are narrow; signals decay rapidly as market absorbs new information.
Mapping Challenges: False positives in ticker mapping (e.g., $CAT vs. actual conversations about cats).
Sample Bias: Not all investor segments use Twitter equally; retail narratives may not reflect institutional flows.
Regulatory/Ethical Concerns: Market manipulation or misinformation can proliferate, raising compliance risks.

References

Data Overview

Data Processing Pipeline

Features for Predictive Modeling

Example feature schema

Alpha Hypotheses

Risks and Mitigation

Reddit

Company Press Release