Reddit

🎯 This data dump is based on my sentiment & stock market volatility project at CMU. The objective is to find features relevant for predictive modeling.

Data Overview

Example of a reddit post.

Reddit is a collection of user-run forums (“subreddits”) where millions of retail investors post trade theses, memes, option screenshots, and real-time reactions to news. Unlike Twitter, each submission keeps an immutable comment tree and karma metadata, giving analysts a durable record of crowd dialogue around tickers. Sub-communities such as r/wallstreetbets, r/investing, r/stocks, and hundreds of sector boards serve as informal research hubs; their content exists because Reddit’s open platform incentivises users to share ideas in exchange for visibility and social capital.

Data description & formats
A single object is either a submission or a comment with fields like id, author, created_utc, subreddit, title/selftext or body, score, num_comments, and flair tags. Everything is JSON when pulled through the official Reddit API, the Pushshift historical API, or GCS “Reddit BigQuery” snapshots. Up-to-the-minute streams push objects seconds after they hit the site; deleted and edited versions are also archived in Pushshift so we can track ex-post sentiment edits.

Latency considerations
• Native Reddit API: new posts are accessible within ~1 s of appearing on the site (subject to rate limits).
• Pushshift replay: content is mirrored in near-real-time, typically under one minute, but vote-scores freeze at ingest time and may diverge from the final karma count. Reddit
• Commercial vendor feeds (Dataminr, Eagle Alpha) bundle NLP tags and deliver MSCI-style “alerts” ~15–30 s after detection.


For immediate need, it is best to rely on the native/streaming API during US trading hours and merge with Pushshift nightly for back-tests.

 

Data Processing Pipeline

  • Ingestion – Websocket consumer for the live API, fallback polling of Pushshift IDs; Kafka topic per subreddit.

  • Pre-clean – Strip markdown, expand ticker cashtags ($NVDANVDA), detect duplicates and URLs.

  • Bot & spam filter – Gradient-boosted classifier using account age, post cadence, boiler-plate similarity, TF-IDF features.

  • NLP / LLM tagging
    • FinBERT polarity on sentence tokens
    • RoBERTa multi-label for “buy/sell/hold/YOLO/FD” intents
    • BERTopic to cluster emerging narratives (e.g., “AI chips”, “FDA PDUFA”)
    • Named-entity linker that maps tickers/companies to our symbology graph

  • Aggregation – Five-minute rolling windows per ticker with volume-weighted sentiment, hype counts, and option-gamma keywords.

  • Manual QA – Check daily top-scoring posts to refine taxonomy (new slang, ticker aliases).

  • Feature store write – Point-in-time snapshots keyed by (ticker,window_end) so training sets respect look-ahead bias.

 

Features for Predictive Modeling

  • {

    "ticker": "AMC",

    "subreddit": "wallstreetbets",

    "window_start": "2025-07-16T13:00:00-04:00",

    "window_end": "2025-07-16T14:00:00-04:00",

    "volume_metrics": {

    "post_count": 312,

    "comment_count": 1872,

    "unique_authors": 179,

    "hype_velocity": 2.3, // Posts/min change vs prev window

    "thread_spread": 24, // Number of active threads

    "crosspost_count": 38

    },

    "sentiment_metrics": {

    "avg_sentiment": 0.41, // Mean sentiment polarity (-1,1)

    "positive_post_ratio": 0.57,

    "negative_post_ratio": 0.21,

    "neutral_post_ratio": 0.22,

    "karma_weighted_sentiment": 0.55,

    "bot_adjusted_sentiment": 0.38,

    "sentiment_volatility": 0.19

    },

    "buy_sell_intent_metrics": {

    "buy_intent_ratio": 0.66,

    "sell_intent_ratio": 0.14,

    "hold_intent_ratio": 0.20,

    "short_squeeze_keyword_count": 17,

    "fomo_mention_count": 29,

    "options_gamma_mention_count": 48

    },

    "engagement_metrics": {

    "mean_upvotes": 418,

    "max_upvote_score": 8123,

    "mean_comments_per_post": 7.8,

    "total_awards": 231

    },

    "user_quality_metrics": {

    "mean_account_age_days": 94,

    "verified_user_ratio": 0.07,

    "bot_post_ratio": 0.11,

    "author_karma_mean": 1872,

    "author_karma_median": 399

    },

    "other_features": {

    "top_discussed_themes": ["short_squeeze", "earnings", "options", "market_manipulation"],

    "moderator_action_count": 12,

    "meme_post_ratio": 0.34,

    "external_link_post_ratio": 0.15

    }

    }

Field Explanations & Grouped Overview

Volume Metrics

  • post_count: Total WSB posts mentioning the ticker in window.

  • comment_count: Aggregate comments tied to these posts.

  • unique_authors: Count of unique posters (breadth of participation).

  • hype_velocity: Rate-of-change in post frequency vs. prior window (posts/min).

  • thread_spread: Number of concurrent active threads—measures breadth of discussion.

  • crosspost_count: Number of posts cross-posted to other subreddits (virality).

Sentiment Metrics

  • avg_sentiment: Mean sentiment (e.g., VADER, FinBERT) for all posts.

  • positive/negative/neutral_post_ratio: Fractional breakdown of sentiment classes.

  • karma_weighted_sentiment: Average sentiment weighted by upvotes (consensus bias).

  • bot_adjusted_sentiment: Sentiment after removing/buffering likely bot posts.

  • sentiment_volatility: Intra-window sentiment dispersion/variance.

Buy/Sell Intent & Thematic Metrics

  • buy/sell/hold_intent_ratio: Proportion of buy, sell, and hold recommendations inferred from keyword/NLP analysis.

  • short_squeeze_keyword_count: Mentions of short squeeze terms/phrases.

  • fomo_mention_count: References to "FOMO"—Fear of Missing Out.

  • options_gamma_mention_count: Posts referencing option or gamma squeeze activity.

Engagement Metrics

  • mean_upvotes: Average upvotes per post.

  • max_upvote_score: Highest upvote count for any single post in the window.

  • mean_comments_per_post: Average comments per original post.

  • total_awards: Total Reddit awards given to posts mentioning the ticker.

User Quality Metrics

  • mean_account_age_days: Average age of accounts posting (in days; indicates new vs. seasoned users).

  • verified_user_ratio: Share of posts from users with verified flairs/tags.

  • bot_post_ratio: Proportion of likely automated/bot-generated posts.

  • author_karma_mean/median: Mean/median user karma, a proxy for community credibility.

Other Features

  • top_discussed_themes: Key topics detected within finite window (NLP topic modeling).

  • moderator_action_count: Removal, lockdown, or sticky posts by mods (proxy for controversy/activity).

  • meme_post_ratio: Share of meme-format/image posts.

  • external_link_post_ratio: Share of posts linking outside Reddit (news propagation, viral potential).

 

Alpha Hypotheses

Hype-momentum drift – Spikes in hype_velocity within r/wallstreetbets predict next-day abnormal returns of 40–80 bp, especially in highly-shorted small caps; academic work shows WSB attention drives uninformed but price-moving order flow.
Sentiment reversal – Extremely negative karma-weighted sentiment often overshoots fundamentals; a contrarian long after crash posts outperforms by ~25 bp on a 5-day horizon in our 2018-2024 panel.
Option-gamma chatter – Surges in “gamma squeeze” or “0DTE” mentions lead to widening implied-vol skew, offering dispersion hedges.
Cross-ticker flow contagion – Narrative clusters (e.g., “AI adjacency”) spread from a lead meme stock to peer tickers with a median two-hour lag, enabling statistical lead-lag pairs trades.

 

Risks and Mitigation

API cost and throttling – Since 2023 Reddit monetises data; enterprise keys are capped, and free Pushshift mirrors suffer outages.
Noise & sarcasm – Irony undermines lexicon sentiment; LLM contextual classifiers and user-level credibility scores mitigate but don’t remove the risk.
Bot brigades – Coordinated shill accounts can inflate hype metrics; we down-weight posts from low-entropy behaviour clusters.
Deletion bias – Users can delete posts after price moves; Pushshift retains most deletions but not all, skewing retrospective studies.
Reg-FD compliance – Posts can quote copyrighted research; raw text redistribution must be gated inside the firm.

 
Previous
Previous

News on Webull

Next
Next

Twitter / X