Company Press Release
🎯 This data dump is based on my autonomous vehicle research with Ford at UC San Diego. I was working as a research assistant on human-driver AI collaboration but was also interested in learning about how to use company press releases post-product launch for predictive modeling.
Data Overview
Unlike mandatory SEC filings, a press release is a voluntary, issuer-authored news bulletin, pushed simultaneously to Business Wire/PR Newswire, e-mail lists, and the firm’s IR page, to shape the narrative around product launches, earnings pre-announcements, M&A, or C-suite changes.
Because management controls tone and timing, releases often precede, or embellish, corresponding 8-K items; one academic survey finds firms issue a standalone press release for 37 % of events that also require an 8-K, and half of those go before the regulatory filing, giving a tradable head-start to anyone who ingests them instantly. Event-study work shows positive-tone releases generate statistically significant cumulative abnormal returns, especially for small-caps and recent research links sentiment in earnings releases to both next-day volatility and 20-day drift.
Relevance for predictive modeling
Processed within sub-second latency and cross-validated against 8-K content, company press releases supply a high-frequency, management-voiced layer of information that complements regulatory filings and traditional news.
Raw structure and latency
A wire payload arrives as JSON or RSS with fields: headline
, subhead
, dateline
, publish_time
, body_html
, tickers
, industry_tags
, and contact_info
. The HTML page (see Ford’s hands-free-driving example) follows the same pattern—title, bullet highlights, management quotes, boiler-plate safe-harbor, IR contacts, and publishing time stamp.
Business Wire guarantees < 300 ms delivery after issuer release; EDGAR can lag minutes for the companion 8-K. Treat press-release RSS/WebSocket feeds as “T-0” news and regulatory filings as confirmation.
Data Processing Pipeline
This is an overview of what the pipeline could look like as part of a first-draft requirements sheet. Teams should refine based on tech stack and custom needs.
Ingest – subscribe to Business Wire/PR Newswire tick-stream plus hourly crawls of IR pages; write raw JSON blobs to Kafka partitioned by
ticker
.HTML clean & de-dup – strip tags, normalise Unicode; hash (
headline
,publish_time
) to collapse multiple syndications.Ticker & entity tagging – regex for
$AAPL
/ canonical names; cross-check against a ticker-brand knowledge graph. Multi-company releases explode into one record per equity.NLP/LLM enrichment – run FinBERT sentiment on headline & bullet list; classify release type (earnings-update, product, governance, litigation, ESG) with a RoBERTa model; compute novelty as cosine distance vs previous 90-day release corpus.
Point-in-time join – attach last trade, bid-ask, option-IV snapshot; write (
ticker
,event_time
) features to Redis (4-h TTL) for execution engines and Iceberg for history.Manual QA – daily review of top-scoring bullish/bearish outliers to refine safe-harbor filters and sarcasm edge-cases; verify headline/date mismatches.
Features for Predictive Modeling
-
{
"ticker": "F",
"press_release_id": "F-2020-06-18a", // Unique release identifier
"event_time": "2020-06-18T10:30:02-04:00",
"release_metadata": {
"headline": "FORD CO-PILOT360 TECHNOLOGY ADDS HANDS-FREE DRIVING, NEW FEATURES TO MAKE DRIVING EASIER AND MORE ENJOYABLE",
"release_url": "https://media.ford.com/content/fordmedia/fna/ca/en/news/2020/06/18/ford-co-pilot360-technology-adds-hands-free-driving.html",
"release_type": "product_launch", // product_launch, earnings, guidance, mna, regulatory, ESG, recall, etc.
"distribution_channel": ["corporate_site", "public_wire"],
"after_hours_flag": false,
"embargo_flag": false,
"word_count": 653,
"reading_time_minutes": 3.2
},
"nlp_features": {
"headline_sentiment": 0.47, // -1 to 1 (positive/negative)
"body_sentiment": 0.32,
"sentiment_confidence": 0.91, // LLM or model confidence in sentiment detection
"novelty_score": 0.71, // Relative to trailing 90-day corpus
"subjectivity_score": 0.15, // 0 = fact, 1 = highly subjective/opinion
"forward_guidance_flag": true, // If language implies future outlook/guidance
"event_keywords": ["ADAS", "hands-free", "driver assistance", "Mustang Mach-E"]
},
"structure_and_content": {
"quotes_count": 5, // Count of attributed quotations/speaker turns
"c_level_quote_count": 2, // CEO/CFO/President named
"safe_harbor_ratio": 0.08, // % of text as boilerplate/risk language
"body_bullet_presence": true, // If key facts appear as bullet points
"mentions_competitors": ["GM", "Tesla"], // List of directly mentioned competitors
"product_names_mentioned": ["Ford Co-Pilot360", "Mustang Mach-E"],
"esg_statement_flag": false, // Explicit ESG language
"recall_flag": false, // Explicit recall/defect disclosure
"regulatory_notice_flag": false // Explicit regulatory or legal update
},
"release_impact_context": {
"related_ticker_mentions": ["TSLA", "GM"],
"sector": "Automotive",
"region": ["US", "Canada"],
"previous_release_novelty_score": 0.51,
"days_since_last_material_release": 54,
"media_pickup_estimate": 129, // Estimated or actual major media pickups
"topic_embedding_vector": [0.012, -0.045, ...] // Text embedding for deep models
}
}
Field Explanations & Grouped Overview
Release Metadata (release_metadata
)
press_release_id: Unique identifier.
headline: Full release headline text.
release_url: Reference for traceability and compliance.
release_type: Encodes high-level event type for factor modeling.
distribution_channel: Sources where the release appears (site, wire, SEC, etc.).
after_hours_flag/embargo_flag: Timing context and information advantage.
word_count/reading_time_minutes: Measures of informativeness and breadth.
NLP Features (nlp_features
)
headline_sentiment/body_sentiment: Polarity scores for headline and full text using domain-tuned models.
sentiment_confidence: Model-reported reliability of sentiment scores.
novelty_score: Cosine/spectral similarity to recent releases (attention proxy; see FININ model2).
subjectivity_score: Objective/factual vs. opinionated tone.
forward_guidance_flag: True if press release language provides future performance or product outlook.
event_keywords: Key phrases, products, or technology themes extracted from text27.
Structure and Content Markup (structure_and_content
)
quotes_count: Number of quoted speakers—higher counts often occur in product or M&A news for credibility.
c_level_quote_count: Number directly from CEO/CFO/President; indicates event importance.
safe_harbor_ratio: Fraction of safe-harbor (legal/pre-emptive risk) language—often inversely correlated with informativeness.
body_bullet_presence: Indicates use of bullets, often for key claims or data.
mentions_competitors: Which publicly traded competitors are referenced.
product_names_mentioned: List of covered products or systems.
flags: ESG, recall, regulatory triggers for anomaly or compliance-oriented modeling.
Release Impact Context (release_impact_context
)
related_ticker_mentions: Peer tickers referenced in text—proxy for competitive context.
sector/region: Aids grouping and universe construction for event studies.
previous_release_novelty_score/days_since_last_material_release: Materials for modeling information surprise or frequency.
media_pickup_estimate: Count of external articles or pickups, actual or estimated (proxy for news reach/virality).
topic_embedding_vector: For use with neural or transformer models (BERT, GloVe).
Alpha Hypotheses
Positive-tone product launches (headline_sent > 0.4, novelty > 0.6) yield +0.5 % abnormal return over the next two sessions in discretionary and tech sectors—backed by Chalmers Univ. press-release CAR study.
Earnings “pre-releases” flagged as
release_type=earnings_update
move post-open volatility; sentiment-weighted surprises improve overnight straddle selection (Zhang 2024 earnings-PR study).Governance releases announcing buy-backs or dividend hikes compress spreads and lead to 20-day momentum; conversely, leadership departures with negative sentiment show −1 % one-day drift.
Timing asymmetry—before-market vs after-close: morning bullish releases have stronger day-one continuation, while bearish after-close items see gap-down then partial reversal, exploitable with open-close intraday books.
Risks and Mitigation
Marketing language can distort sentiment
FinBERT sentiment scores may be overly positive due to promotional phrasing. Improve accuracy by combining with:novelty_score
(how new the language is)Mentions of concrete terms like “orders” or “backlog”
Duplicate headlines are common
News syndication often floods feeds with the same story.De-duplicate using a hash of the headline and timestamp
Or rely on official Business Wire GUIDs when available
Small-cap disclosures may be website-only
Some small public companies post press releases only on their own sites.Maintain fallback web crawlers to catch these
Boilerplate sections dilute sentiment analysis
Standard disclaimers like “forward-looking statements” or ESG blurbs add noise.Exclude sections that match known boilerplate patterns
Simultaneous 8-Ks can fragment event timestamps
Companies may file 8-Ks and issue press releases at different times.Use the earliest press wire timestamp for market reaction modeling
Tag EDGAR confirmation time separately as a feature