🎯 This data dump is based on my sentiment & stock market volatility project at CMU. The objective is to find features relevant for predictive modeling.
Data Overview
Example of a reddit post.
Reddit is a collection of user-run forums (“subreddits”) where millions of retail investors post trade theses, memes, option screenshots, and real-time reactions to news. Unlike Twitter, each submission keeps an immutable comment tree and karma metadata, giving analysts a durable record of crowd dialogue around tickers. Sub-communities such as r/wallstreetbets, r/investing, r/stocks, and hundreds of sector boards serve as informal research hubs; their content exists because Reddit’s open platform incentivises users to share ideas in exchange for visibility and social capital.
Data description & formats
A single object is either a submission or a comment with fields like id
, author
, created_utc
, subreddit
, title/selftext
or body
, score
, num_comments
, and flair tags. Everything is JSON when pulled through the official Reddit API, the Pushshift historical API, or GCS “Reddit BigQuery” snapshots. Up-to-the-minute streams push objects seconds after they hit the site; deleted and edited versions are also archived in Pushshift so we can track ex-post sentiment edits.
Latency considerations
• Native Reddit API: new posts are accessible within ~1 s of appearing on the site (subject to rate limits).
• Pushshift replay: content is mirrored in near-real-time, typically under one minute, but vote-scores freeze at ingest time and may diverge from the final karma count. Reddit
• Commercial vendor feeds (Dataminr, Eagle Alpha) bundle NLP tags and deliver MSCI-style “alerts” ~15–30 s after detection.
For immediate need, it is best to rely on the native/streaming API during US trading hours and merge with Pushshift nightly for back-tests.
Data Processing Pipeline
Ingestion – Websocket consumer for the live API, fallback polling of Pushshift IDs; Kafka topic per subreddit.
Pre-clean – Strip markdown, expand ticker cashtags (
$NVDA
→NVDA
), detect duplicates and URLs.Bot & spam filter – Gradient-boosted classifier using account age, post cadence, boiler-plate similarity, TF-IDF features.
NLP / LLM tagging –
• FinBERT polarity on sentence tokens
• RoBERTa multi-label for “buy/sell/hold/YOLO/FD” intents
• BERTopic to cluster emerging narratives (e.g., “AI chips”, “FDA PDUFA”)
• Named-entity linker that maps tickers/companies to our symbology graphAggregation – Five-minute rolling windows per ticker with volume-weighted sentiment, hype counts, and option-gamma keywords.
Manual QA – Check daily top-scoring posts to refine taxonomy (new slang, ticker aliases).
Feature store write – Point-in-time snapshots keyed by (
ticker
,window_end
) so training sets respect look-ahead bias.
Features for Predictive Modeling
-
{
"ticker": "AMC",
"subreddit": "wallstreetbets",
"window_start": "2025-07-16T13:00:00-04:00",
"window_end": "2025-07-16T14:00:00-04:00",
"volume_metrics": {
"post_count": 312,
"comment_count": 1872,
"unique_authors": 179,
"hype_velocity": 2.3, // Posts/min change vs prev window
"thread_spread": 24, // Number of active threads
"crosspost_count": 38
},
"sentiment_metrics": {
"avg_sentiment": 0.41, // Mean sentiment polarity (-1,1)
"positive_post_ratio": 0.57,
"negative_post_ratio": 0.21,
"neutral_post_ratio": 0.22,
"karma_weighted_sentiment": 0.55,
"bot_adjusted_sentiment": 0.38,
"sentiment_volatility": 0.19
},
"buy_sell_intent_metrics": {
"buy_intent_ratio": 0.66,
"sell_intent_ratio": 0.14,
"hold_intent_ratio": 0.20,
"short_squeeze_keyword_count": 17,
"fomo_mention_count": 29,
"options_gamma_mention_count": 48
},
"engagement_metrics": {
"mean_upvotes": 418,
"max_upvote_score": 8123,
"mean_comments_per_post": 7.8,
"total_awards": 231
},
"user_quality_metrics": {
"mean_account_age_days": 94,
"verified_user_ratio": 0.07,
"bot_post_ratio": 0.11,
"author_karma_mean": 1872,
"author_karma_median": 399
},
"other_features": {
"top_discussed_themes": ["short_squeeze", "earnings", "options", "market_manipulation"],
"moderator_action_count": 12,
"meme_post_ratio": 0.34,
"external_link_post_ratio": 0.15
}
}
Field Explanations & Grouped Overview
Volume Metrics
post_count: Total WSB posts mentioning the ticker in window.
comment_count: Aggregate comments tied to these posts.
unique_authors: Count of unique posters (breadth of participation).
hype_velocity: Rate-of-change in post frequency vs. prior window (posts/min).
thread_spread: Number of concurrent active threads—measures breadth of discussion.
crosspost_count: Number of posts cross-posted to other subreddits (virality).
Sentiment Metrics
avg_sentiment: Mean sentiment (e.g., VADER, FinBERT) for all posts.
positive/negative/neutral_post_ratio: Fractional breakdown of sentiment classes.
karma_weighted_sentiment: Average sentiment weighted by upvotes (consensus bias).
bot_adjusted_sentiment: Sentiment after removing/buffering likely bot posts.
sentiment_volatility: Intra-window sentiment dispersion/variance.
Buy/Sell Intent & Thematic Metrics
buy/sell/hold_intent_ratio: Proportion of buy, sell, and hold recommendations inferred from keyword/NLP analysis.
short_squeeze_keyword_count: Mentions of short squeeze terms/phrases.
fomo_mention_count: References to "FOMO"—Fear of Missing Out.
options_gamma_mention_count: Posts referencing option or gamma squeeze activity.
Engagement Metrics
mean_upvotes: Average upvotes per post.
max_upvote_score: Highest upvote count for any single post in the window.
mean_comments_per_post: Average comments per original post.
total_awards: Total Reddit awards given to posts mentioning the ticker.
User Quality Metrics
mean_account_age_days: Average age of accounts posting (in days; indicates new vs. seasoned users).
verified_user_ratio: Share of posts from users with verified flairs/tags.
bot_post_ratio: Proportion of likely automated/bot-generated posts.
author_karma_mean/median: Mean/median user karma, a proxy for community credibility.
Other Features
top_discussed_themes: Key topics detected within finite window (NLP topic modeling).
moderator_action_count: Removal, lockdown, or sticky posts by mods (proxy for controversy/activity).
meme_post_ratio: Share of meme-format/image posts.
external_link_post_ratio: Share of posts linking outside Reddit (news propagation, viral potential).
Alpha Hypotheses
• Hype-momentum drift – Spikes in hype_velocity
within r/wallstreetbets predict next-day abnormal returns of 40–80 bp, especially in highly-shorted small caps; academic work shows WSB attention drives uninformed but price-moving order flow.
• Sentiment reversal – Extremely negative karma-weighted sentiment often overshoots fundamentals; a contrarian long after crash posts outperforms by ~25 bp on a 5-day horizon in our 2018-2024 panel.
• Option-gamma chatter – Surges in “gamma squeeze” or “0DTE” mentions lead to widening implied-vol skew, offering dispersion hedges.
• Cross-ticker flow contagion – Narrative clusters (e.g., “AI adjacency”) spread from a lead meme stock to peer tickers with a median two-hour lag, enabling statistical lead-lag pairs trades.
Risks and Mitigation
• API cost and throttling – Since 2023 Reddit monetises data; enterprise keys are capped, and free Pushshift mirrors suffer outages.
• Noise & sarcasm – Irony undermines lexicon sentiment; LLM contextual classifiers and user-level credibility scores mitigate but don’t remove the risk.
• Bot brigades – Coordinated shill accounts can inflate hype metrics; we down-weight posts from low-entropy behaviour clusters.
• Deletion bias – Users can delete posts after price moves; Pushshift retains most deletions but not all, skewing retrospective studies.
• Reg-FD compliance – Posts can quote copyrighted research; raw text redistribution must be gated inside the firm.
References