News on Webull

🎯 This data dump is based on my sentiment & stock market volatility project at CMU. The objective is to find features relevant for predictive modeling.

Data Overview

Webull collates breaking headlines, analyst notes and wire stories for every listed ticker so its 20 million retail users can react inside the trading app without opening a browser. Each news card is a mini‐record—publisher, time-stamp, blurb, URL, image—served in real-time both on the site (e.g., /newslist/nasdaq-zm) and through its market-news push socket. Webull

What the raw objects look like
Every item arrives as JSON with keys such as newsId, symbol, title, sourceName, publishTime, summary, url, imageUrl, tags and hasVideo. Attachments sometimes carry videoUrl or full-text if Webull has redistribution rights. A single day for an actively-covered megacap runs 500–800 records ≈ 3 MB compressed; a year of the entire Russell-3000 crosses 100 GB.

Latency profile
• Web interface scrape: headlines surface on the page within 1–3 s of wire release; DOM diff polling reaches us in ~5 s.
• Official push socket: usquotes-api.webullfintech.com broadcasts the same payload with sub-second lag once authenticated via the open-API keys.
• Vendor mirrors: third-party redistributors (Dataminr, Benzinga) forward annotated Webull items in 10–30 s.

Data Processing Pipeline

This is an overview of what the pipeline could look like as part of a first-draft requirements sheet. Teams should refine based on tech stack and custom needs.

Ingest – for my webscraping project this is what I did:
- A head-less Chrome worker (the scrape_webull_headlines() routine in the screenshot) back-fills each ticker as follows: it launches a maximised Chromium session, lands on https://www.webull.com/newslist/nasdaq-<TICKER>, and repeatedly clicks the “More” button (//div[@class='csr19'][text()='More']). After every click it scrapes parallel node lists—.csr14 for the headline text and .csr15 for the time-stamp—then trims markdown, splits the MM·DD·YY string, and stops once it reaches a user-defined start-phrase / end-phrase or an older date_tracker. The loop builds two Python lists (scraped_headers, scraped_dates), converts them to a pandas DataFrame, and writes the slice to CSV before publishing the records to a Kafka topic named webull.news.<ticker>. This crawler runs only when the primary Webull market-news WebSocket misses packets or for historical back-fills; otherwise, real-time headlines continue to stream through the push socket and are sharded into Kafka by ticker for downstream dedup & enrichment.
Dedup & normalise – Hash on title+publisher; collapse repeats across wires.
Ticker tagging – Regex for $AAPL, “Apple Inc.”, and knowledge-graph aliases; multi-ticker stories explode into one row per symbol.
NLP/LLM enrichment –
– FinBERT sentiment on headline and summary.
– Context classifier (“earnings-beat”, “options-flow”, “downgrade”) via domain-tuned RoBERTa.
– Hype score = sentiment × publisher-credibility × upvote momentum (if story propagates to Webull community).
Join with price tape – Point-in-time snapshot keyed (ticker, eventTime) written to the feature store; four-hour retention in Redis for execution micro-services and forever in Iceberg for back-tests.
Manual QA – Daily spot review of top-scoring outliers to update stop-word and sarcasm lists.

Features for Predictive Modeling

{
"ticker": "ZM",
"event_time": "2025-07-16T09:32:07-04:00",
"headline_content": {
"headline_text": "Zoom signals strong growth, updates guidance higher",
"headline_length_tokens": 12,
"headline_length_chars": 54,
"headline_sentiment": 0.42,
"sentiment_label": "positive",
"headline_subjectivity": 0.28,
"headline_emb_vector": [0.013, -0.025, ...], // Text model embedding vector (truncated for display)
"volatility_keywords_flag": 0,
"macro_keywords_flag": 0,
"earnings_keywords_flag": 1
},
"event_annotations": {
"story_type": "guidance_update",
"topic_label": ["guidance", "earnings", "growth"],
"is_breaking_news": true,
"is_retraction": false,
"urgency_score": 0.87, // LLM-driven or publisher supplied
"similar_headline_count_5m": 3
},
"publisher_metrics": {
"publisher": "Reuters",
"publisher_rank": 0.93,
"publisher_type": "major_wire",
"historical_impact_score": 0.64,
"is_verified_source": true
},
"temporal_context": {
"after_hours_flag": false,
"market_session": "open",
"headline_to_first_trade_lag_sec": 44, // Time from headline to first high-volume trade (if available)
"headline_to_sentiment_spike_lag_sec": 29
},
"behavioral_signals": {
"hype_score": 78.6,
"headline_engagement_rate": 0.16, // Clicks/views per audience baseline if measured
"social_spread_score": 0.31, // Max share/retweet/Reddit thread ratio post-publication
"reddit_reaction_delay_sec": 37,
"twitter_reaction_delay_sec": 41
},
"interheadline_relations": {
"duplicate_headline_group_id": "grp_748926",
"similar_headline_count_60m": 12,
"first_appearance_time": "2025-07-16T09:31:58-04:00",
"is_sentiment_reversal": false // Headline inverts recent news sentiment trend
}
}

Field Explanations & Groups

Headline Content (headline_content)

headline_text: Raw news headline string.
headline_length_tokens/headline_length_chars: Token/character length; proxies for information density.
headline_sentiment: Polarity score (e.g., FinBERT, VADER, -1 to +1).
sentiment_label: Qualitative label (positive/negative/neutral).
headline_subjectivity: Degree of subjectivity (0–1, higher = more opinion/less fact)4 6.
headline_emb_vector: Precomputed text embedding for use with deep learning models5.
volatility_keywords_flag: 1 if headline references "volatility", "panic", etc.
macro_keywords_flag: 1 if referencing macro events (Fed, inflation, etc.).
earnings_keywords_flag: 1 if referencing earnings, guidance, sales.

Event Annotations (event_annotations)

story_type: Categorical event class (guidance_update, earning_report, mna, regulatory, etc.).
topic_label: NLP-modeled topics for headline context.
is_breaking_news: Flag if major event or "flash headline".
is_retraction: True if news is retracted/corrected after release (affects market trust).
urgency_score: Source/ML-computed relevance or urgency (0–1).
similar_headline_count_5m: Peer news stories in last 5 minutes (“news burst” signaling)1 2 7.

Publisher Metrics (publisher_metrics)

publisher: Outlet name.
publisher_rank: Source credibility/impact (0–1, e.g. based on historical effect on asset price).
publisher_type: "major_wire", "financial_blog", "local_press".
historical_impact_score: Empirical alpha correlation of this source1 2.
is_verified_source: Boolean, rigorous fact-checked/regulated outlet.

Temporal Context (temporal_context)

after_hours_flag: Whether headline published outside standard market hours.
market_session: "pre", "open", "after", "close".
headline_to_first_trade_lag_sec: Time from publication to first notable stock trade.
headline_to_sentiment_spike_lag_sec: Latency to abnormal sentiment shift (Twitter/Reddit/other).

Behavioral Signals (behavioral_signals)

hype_score: Aggregated measure from social/other platforms; strong attention signal.
headline_engagement_rate: Clicks, shares, or views normalized by audience.
social_spread_score: Normalized spread score across social platforms post-publication1.
reddit_reaction_delay_sec / twitter_reaction_delay_sec: Time until headline triggers reaction post1.

Interheadline Relations (interheadline_relations)

duplicate_headline_group_id: ID for clustered headline event (multi-source, deduped).
similar_headline_count_60m: Total similar headlines in the last hour (intensity of news burst).
first_appearance_time: Timestamp of first appearance for this headline cluster.
is_sentiment_reversal: Headline direction opposes the recent rolling sentiment trend.

Alpha Hypotheses

• Sentiment-momentum: Positive FinBERT polarity coupled with ≥ +2 σ hype_score foreshadows a 20–40 bp intraday drift in the direction of sentiment, strongest in mid-caps with high retail ownership.
• Source-tier dispersion: Disagreement between Tier-1 (Reuters, Bloomberg) and retail-blog publishers creates mean-reversion opportunities when retail mood over-reacts.
• Temporal clustering: Bursts of ≥ 5 Webull items within two minutes often precede option-flow spikes; capturing the first item lets us anticipate IV smile changes.
• After-hours asymmetry: Headlines posted 16:00–20:00 ET correlate with wider next-day open-close ranges, enabling overnight straddles.

Risks and Mitigation

• HTML scrape stability – Site redesigns break selectors; keep the official push socket as primary.
• Duplicate headlines – Same wire story syndicated under multiple publishers inflates sentiment—dedupe rigorously.
• Pay-wall redirects – URLs sometimes 302 to subscription pages; headline text may omit negative nuance—use secondary crawls if sentiment weight > threshold.
• Publisher bias – Low-credibility blogs exaggerate price targets; maintain a decay-weighted credibility score.
• Rate-limit & TOS – Manual scraping violates Webull’s service terms and can throttle IPs; prioritise authenticated API traffic.
• Alpha decay – As retail desks adopt similar headline-sentiment feeds, expect compression; focus on micro-structure interactions (e.g., option IV lead) rather than raw direction.

References

Data Overview

Data Processing Pipeline

Features for Predictive Modeling

Example feature schema

Alpha Hypotheses

Risks and Mitigation

Credit / Debit-Card Transaction Panels

Reddit