Company Press Release

🎯 This data dump is based on my autonomous vehicle research with Ford at UC San Diego. I was working as a research assistant on human-driver AI collaboration but was also interested in learning about how to use company press releases post-product launch for predictive modeling.

Data Overview

Unlike mandatory SEC filings, a press release is a voluntary, issuer-authored news bulletin, pushed simultaneously to Business Wire/PR Newswire, e-mail lists, and the firm’s IR page, to shape the narrative around product launches, earnings pre-announcements, M&A, or C-suite changes.

Because management controls tone and timing, releases often precede, or embellish, corresponding 8-K items; one academic survey finds firms issue a standalone press release for 37 % of events that also require an 8-K, and half of those go before the regulatory filing, giving a tradable head-start to anyone who ingests them instantly. Event-study work shows positive-tone releases generate statistically significant cumulative abnormal returns, especially for small-caps and recent research links sentiment in earnings releases to both next-day volatility and 20-day drift.

Relevance for predictive modeling

Processed within sub-second latency and cross-validated against 8-K content, company press releases supply a high-frequency, management-voiced layer of information that complements regulatory filings and traditional news.

Raw structure and latency

A wire payload arrives as JSON or RSS with fields: headline, subhead, dateline, publish_time, body_html, tickers, industry_tags, and contact_info. The HTML page (see Ford’s hands-free-driving example) follows the same pattern—title, bullet highlights, management quotes, boiler-plate safe-harbor, IR contacts, and publishing time stamp.

Business Wire guarantees < 300 ms delivery after issuer release; EDGAR can lag minutes for the companion 8-K. Treat press-release RSS/WebSocket feeds as “T-0” news and regulatory filings as confirmation.

 

Data Processing Pipeline

This is an overview of what the pipeline could look like as part of a first-draft requirements sheet. Teams should refine based on tech stack and custom needs.

  • Ingest – subscribe to Business Wire/PR Newswire tick-stream plus hourly crawls of IR pages; write raw JSON blobs to Kafka partitioned by ticker.

  • HTML clean & de-dup – strip tags, normalise Unicode; hash (headline, publish_time) to collapse multiple syndications.

  • Ticker & entity tagging – regex for $AAPL / canonical names; cross-check against a ticker-brand knowledge graph. Multi-company releases explode into one record per equity.

  • NLP/LLM enrichment – run FinBERT sentiment on headline & bullet list; classify release type (earnings-update, product, governance, litigation, ESG) with a RoBERTa model; compute novelty as cosine distance vs previous 90-day release corpus.

  • Point-in-time join – attach last trade, bid-ask, option-IV snapshot; write (ticker,event_time) features to Redis (4-h TTL) for execution engines and Iceberg for history.

  • Manual QA – daily review of top-scoring bullish/bearish outliers to refine safe-harbor filters and sarcasm edge-cases; verify headline/date mismatches.

 

Features for Predictive Modeling

  • {

    "ticker": "F",

    "press_release_id": "F-2020-06-18a", // Unique release identifier

    "event_time": "2020-06-18T10:30:02-04:00",

    "release_metadata": {

    "headline": "FORD CO-PILOT360 TECHNOLOGY ADDS HANDS-FREE DRIVING, NEW FEATURES TO MAKE DRIVING EASIER AND MORE ENJOYABLE",

    "release_url": "https://media.ford.com/content/fordmedia/fna/ca/en/news/2020/06/18/ford-co-pilot360-technology-adds-hands-free-driving.html",

    "release_type": "product_launch", // product_launch, earnings, guidance, mna, regulatory, ESG, recall, etc.

    "distribution_channel": ["corporate_site", "public_wire"],

    "after_hours_flag": false,

    "embargo_flag": false,

    "word_count": 653,

    "reading_time_minutes": 3.2

    },

    "nlp_features": {

    "headline_sentiment": 0.47, // -1 to 1 (positive/negative)

    "body_sentiment": 0.32,

    "sentiment_confidence": 0.91, // LLM or model confidence in sentiment detection

    "novelty_score": 0.71, // Relative to trailing 90-day corpus

    "subjectivity_score": 0.15, // 0 = fact, 1 = highly subjective/opinion

    "forward_guidance_flag": true, // If language implies future outlook/guidance

    "event_keywords": ["ADAS", "hands-free", "driver assistance", "Mustang Mach-E"]

    },

    "structure_and_content": {

    "quotes_count": 5, // Count of attributed quotations/speaker turns

    "c_level_quote_count": 2, // CEO/CFO/President named

    "safe_harbor_ratio": 0.08, // % of text as boilerplate/risk language

    "body_bullet_presence": true, // If key facts appear as bullet points

    "mentions_competitors": ["GM", "Tesla"], // List of directly mentioned competitors

    "product_names_mentioned": ["Ford Co-Pilot360", "Mustang Mach-E"],

    "esg_statement_flag": false, // Explicit ESG language

    "recall_flag": false, // Explicit recall/defect disclosure

    "regulatory_notice_flag": false // Explicit regulatory or legal update

    },

    "release_impact_context": {

    "related_ticker_mentions": ["TSLA", "GM"],

    "sector": "Automotive",

    "region": ["US", "Canada"],

    "previous_release_novelty_score": 0.51,

    "days_since_last_material_release": 54,

    "media_pickup_estimate": 129, // Estimated or actual major media pickups

    "topic_embedding_vector": [0.012, -0.045, ...] // Text embedding for deep models

    }

    }

Field Explanations & Grouped Overview

Release Metadata (release_metadata)

  • press_release_id: Unique identifier.

  • headline: Full release headline text.

  • release_url: Reference for traceability and compliance.

  • release_type: Encodes high-level event type for factor modeling.

  • distribution_channel: Sources where the release appears (site, wire, SEC, etc.).

  • after_hours_flag/embargo_flag: Timing context and information advantage.

  • word_count/reading_time_minutes: Measures of informativeness and breadth.

NLP Features (nlp_features)

  • headline_sentiment/body_sentiment: Polarity scores for headline and full text using domain-tuned models.

  • sentiment_confidence: Model-reported reliability of sentiment scores.

  • novelty_score: Cosine/spectral similarity to recent releases (attention proxy; see FININ model2).

  • subjectivity_score: Objective/factual vs. opinionated tone.

  • forward_guidance_flag: True if press release language provides future performance or product outlook.

  • event_keywords: Key phrases, products, or technology themes extracted from text27.

Structure and Content Markup (structure_and_content)

  • quotes_count: Number of quoted speakers—higher counts often occur in product or M&A news for credibility.

  • c_level_quote_count: Number directly from CEO/CFO/President; indicates event importance.

  • safe_harbor_ratio: Fraction of safe-harbor (legal/pre-emptive risk) language—often inversely correlated with informativeness.

  • body_bullet_presence: Indicates use of bullets, often for key claims or data.

  • mentions_competitors: Which publicly traded competitors are referenced.

  • product_names_mentioned: List of covered products or systems.

  • flags: ESG, recall, regulatory triggers for anomaly or compliance-oriented modeling.

Release Impact Context (release_impact_context)

  • related_ticker_mentions: Peer tickers referenced in text—proxy for competitive context.

  • sector/region: Aids grouping and universe construction for event studies.

  • previous_release_novelty_score/days_since_last_material_release: Materials for modeling information surprise or frequency.

  • media_pickup_estimate: Count of external articles or pickups, actual or estimated (proxy for news reach/virality).

  • topic_embedding_vector: For use with neural or transformer models (BERT, GloVe).

 

Alpha Hypotheses

  • Positive-tone product launches (headline_sent > 0.4, novelty > 0.6) yield +0.5 % abnormal return over the next two sessions in discretionary and tech sectors—backed by Chalmers Univ. press-release CAR study.

  • Earnings “pre-releases” flagged as release_type=earnings_update move post-open volatility; sentiment-weighted surprises improve overnight straddle selection (Zhang 2024 earnings-PR study).

  • Governance releases announcing buy-backs or dividend hikes compress spreads and lead to 20-day momentum; conversely, leadership departures with negative sentiment show −1 % one-day drift.

  • Timing asymmetry—before-market vs after-close: morning bullish releases have stronger day-one continuation, while bearish after-close items see gap-down then partial reversal, exploitable with open-close intraday books.

 

Risks and Mitigation

  • Marketing language can distort sentiment
    FinBERT sentiment scores may be overly positive due to promotional phrasing. Improve accuracy by combining with:

    • novelty_score (how new the language is)

    • Mentions of concrete terms like “orders” or “backlog”

  • Duplicate headlines are common
    News syndication often floods feeds with the same story.

    • De-duplicate using a hash of the headline and timestamp

    • Or rely on official Business Wire GUIDs when available

  • Small-cap disclosures may be website-only
    Some small public companies post press releases only on their own sites.

    • Maintain fallback web crawlers to catch these

  • Boilerplate sections dilute sentiment analysis
    Standard disclaimers like “forward-looking statements” or ESG blurbs add noise.

    • Exclude sections that match known boilerplate patterns

  • Simultaneous 8-Ks can fragment event timestamps
    Companies may file 8-Ks and issue press releases at different times.

    • Use the earliest press wire timestamp for market reaction modeling

    • Tag EDGAR confirmation time separately as a feature

 
Previous
Previous

Twitter / X