10-K (Annual Report, SEC Filing)
🎯 This data dump is based on my personal research while at BNY.
Data Overview


Example of the first 2 pages of a 10-K (Typical 10-Ks range from 80 to 300 pages, depending on the company’s size, complexity, and disclosure requirements.)
A Form 10-K is the U.S. Securities & Exchange Commission’s (SEC) legally-mandated annual report in which every public issuer must provide a comprehensive, audited account of its business, financial condition, risk profile and governance for the just-ended fiscal year. Required under the Securities Exchange Act of 1934, the 10-K complements, but is far more detailed than, the glossy shareholder report, ensuring all market participants have equal-access, decision-critical information and satisfying Regulation FD’s fair-disclosure regime. Properly parsed, version-controlled 10-K disclosures offer a slow-moving yet powerful fundamental overlay to a fast-paced stat-arb stack, capturing narrative shifts, hidden liabilities and management tone that price-only models miss.
Relevance for predictive modeling
Inline-XBRL financial statements inside are machine-readable fundamentals with no vendor lag. Ratio surprises (working-capital accruals, leverage, segment revenue volatility) help orthogonalise price-momentum factors.
Document anatomy & formats
Core Section (SEC Item) | Typical Analytics Value |
---|---|
Item 1 – Business | Narrative on segments, customers, suppliers, competition |
Item 1A – Risk Factors | Forward-looking risk disclosure; language shifts predict future under-performance |
Item 7 – MD&A | Management’s colour on results & outlook; lexical tone foreshadows guidance bias |
Financials (F-statements & notes) | Accruals, footnote contingencies, off-balance sheet items |
Exhibits | Debt covenants, material contracts, ESG data |
Formats:
EDGAR delivers filings as raw HTML and inline XBRL. Vendors like Capital IQ and PrivCo republish them as structured JSON or XML feeds. A complete archive of S&P 1500 10-K filings over the past 25 years is approximately 120 GB in compressed text. When parsed into tabular form using inline XBRL, the data volume roughly triples.
Latency:
SEC rules require large accelerated filers to submit 10-Ks within 60 days of fiscal year-end, and 90 days for accelerated filers. Most companies file earlier—the median lag is about 47 days. In cases where firms file late, NT-10K filings show the median delay is around 15 days. Amended filings (10-K/A) may arrive later, but usually appear sporadically and without systematic timing.
Data Processing Pipeline
This is an overview of what the pipeline could look like as part of a first-draft requirements sheet. Teams should refine based on tech stack and custom needs.
Ingest
The pipeline starts by gathering data from multiple sources:Regularly polls the EDGAR RSS feed for new filings.
Listens to vendor pushes via Kafka.
Uses checksums to remove duplicate documents across sources.
Parsing
Next, the raw filings are parsed into structured formats:HTML filings are processed using BeautifulSoup.
Inline XBRL sections are parsed with Arelle and converted into Parquet format.
Sections are identified and split using regular expressions (e.g., detecting headers like “Item 1A.”).
Cleaning
The narrative content is cleaned and normalized:Strips tables from narrative sections.
Removes remaining HTML tags.
Standardizes all Unicode characters for consistency.
Manual QA (Quality Assurance)
Human reviewers inspect a small sample weekly:0.25% of filings are spot-checked to ensure accurate section splitting.
NLP/LLM Annotation
Several layers of text analysis are applied:Sentence-level sentiment analysis using FinBERT.
Risk factor detection using a SciBERT model fine-tuned on “Item 1A” disclosures.
Year-over-year text comparison using transformer-based edit distance.
Topic modeling and embeddings with BERTopic.
Numeric Extraction
Financial footnotes and values are programmatically extracted:Footnote content is scanned using regex.
Contextual classifiers determine the category (e.g., pension obligations, lease liabilities).
Feature Store Update
Finally, all structured outputs are saved:A point-in-time snapshot is stored, keyed by
(ticker, fye_date)
for use in downstream models.
Features for Predictive Modeling
-
{
"ticker": "TSLA",
"fye_date": "2023-12-31",
"meta": {
"filing_datetime": "2024-02-28T19:56:14-05:00",
"after_hours_flag": true,
"edgar_accession_number": "0001318605-25-000015",
"revision_count": 0,
"filing_length_words": 67580,
"word_count_by_section": {
"risk_factors": 12754,
"mdna": 9460,
"business": 6138,
"financial_statements": 22488,
"legal_proceedings": 1862,
"esg_section": 225
}
},
"risk_disclosure": {
"risk_factor_novelty_jaccard": 0.34,
"risk_factor_update_flag": 1,
"risk_factor_sentiment": -0.14,
"risk_factor_length_words": 12754,
"emerging_risk_keywords": 6,
"cybersecurity_mentions": 12,
"climate_risk_mentions": 3,
"litigation_word_pct": 0.021
},
"mdna_section": {
"mdna_positive_tone": -0.08,
"mdna_forward_looking_pct": 0.17,
"mdna_sentiment_zscore_vs_sector": -0.9,
"mdna_readability_flesch": 26.1,
"mdna_novelty_cosine": 0.22,
"mdna_length_words": 9460
},
"financial_statement_metrics": {
"revenue": 123980000000,
"revenue_growth_yoy_pct": 14.3,
"cost_of_revenue": 79860000000,
"gross_margin_pct": 35.6,
"operating_income": 15780000000,
"net_income": 9982000000,
"ebitda": 17450000000,
"eps_basic": 3.82,
"eps_diluted": 3.64,
"cash_and_equivalents": 22000000000,
"total_assets": 91600000000,
"total_liabilities": 62000000000,
"total_equity": 29600000000,
"operating_cash_flow": 12240000000,
"free_cash_flow": 8640000000,
"debt_to_equity": 1.47
},
"footnotes_and_contingencies": {
"footnote_contingent_liab": 2400000000,
"footnote_lease_obligations": 787000000,
"footnote_pending_litigation_count": 5,
"footnote_tax_risk_flag": 1,
"off_balance_sheet_flag": 0
},
"esg_and_other_qualitative": {
"esg_section_flag": 1,
"esg_keyword_density": 0.015,
"climate_disclosure_flag": 1,
"esg_novelty_jaccard": 0.18,
"supply_chain_word_pct": 0.004
},
"semantic_nlp_features": {
"novel_words_ratio": 0.036,
"named_entity_mentions": {
"ceo_names": ["Elon Musk"],
"subsidiaries": ["Tesla Energy", "Tesla Finance"],
"major_partners": ["Panasonic", "CATL"],
"geographic_expansions": ["Giga Berlin", "Giga Mexico"]
},
"section_topic_weights": {
"ai_autonomy": 0.22,
"battery_tech": 0.11,
"manufacturing_expansion": 0.16,
"regulatory": 0.09,
"macroeconomics": 0.07,
"cybersecurity_privacy": 0.02
}
},
"visual_and_presentation_features": {
"table_count": 43,
"chart_count": 17,
"infographic_flag": 1
}
}
Field Definitions and Explanations
1. Meta & Filing Structure (meta)
filing_datetime: SEC filing time for market reaction modeling.
after_hours_flag: After/before market filing.
edgar_accession_number: Unique SEC filing ID.
revision_count: Number of amendments on EDGAR.
filing_length_words: Total word count.
word_count_by_section: Word counts for key items (Risk, MD&A, Business, Legal, ESG).
2. Risk Disclosures (risk_disclosure)
risk_factor_novelty_jaccard: Jaccard similarity (new vs. prior year) measuring novelty.
risk_factor_update_flag: Indicates new/modified risks disclosed.
risk_factor_sentiment: Sentiment polarity of risk factors (NLP).
risk_factor_length_words: Sectional length.
emerging_risk_keywords: Mentions of new/emerging risk types (e.g., "AI", "geopolitics").
cybersecurity_mentions: Keyword density/count.
climate_risk_mentions: Specific mentions of climate risk/regulation.
litigation_word_pct: % of legal wording in relevant sections.
3. MD&A Analytics (mdna_section)
mdna_positive_tone: Sentiment/NLP score (overall tone, e.g., FinBERT).
mdna_forward_looking_pct: % of text flagged as forward-looking.
mdna_sentiment_zscore_vs_sector: Tone anomaly vs. sector norm.
mdna_readability_flesch: Flesch readability (interpretive difficulty).
mdna_novelty_cosine: Surprise/novelty vs. prior MD&A.
mdna_length_words: Section length.
4. Financial Statement Metrics (financial_statement_metrics)
revenue, revenue_growth_yoy_pct: Total revenue, YoY change.
cost_of_revenue, gross_margin_pct: COGS and profitability profile.
operating_income, net_income, ebitda: Core income measures.
eps_basic, eps_diluted: Earnings per share.
cash_and_equivalents, total_assets, total_liabilities, total_equity: Balance sheet snapshot.
operating_cash_flow, free_cash_flow: Liquidity/health.
debt_to_equity: Leverage.
5. Footnotes & Contingencies (footnotes_and_contingencies)
footnote_contingent_liab: $ value of contingent liabilities.
footnote_lease_obligations: Lease note exposure.
footnote_pending_litigation_count: Pending legal cases disclosed.
footnote_tax_risk_flag: Major tax risk flagged.
off_balance_sheet_flag: Indicates any off-BS arrangements.
6. ESG & Qualitative Disclosures (esg_and_other_qualitative)
esg_section_flag: Explicit presence of ESG section.
esg_keyword_density: Density of ESG-related keywords.
climate_disclosure_flag: Binary—material climate disclosures.
esg_novelty_jaccard: Change/novelty in ESG content.
supply_chain_word_pct: Proportion of supply chain discussion.
7. Semantic NLP Features (semantic_nlp_features)
novel_words_ratio: Proportion of novel vocabulary/phrases.
named_entity_mentions: Key management, subsidiaries, partners, new markets.
section_topic_weights: Topic modeling weights (e.g., "ai_autonomy", "manufacturing_expansion").
8. Visual & Presentation (visual_and_presentation_features)
table_count: Number of tables in filing.
chart_count: Number of figures/charts.
infographic_flag: If infographics are present—which are associated with increased volatility and attention
Alpha Hypotheses
These are theories about how specific patterns in 10-Ks can help predict future stock returns:
H1: Risk-Factor Novelty Premium
When a company makes very few changes to the "Risk Factors" section (Item 1A) from year to year, it tends to perform better. If the language stays the same, it's a sign of stability. Large edits, on the other hand, may signal new risks.
▸ Morgan Stanley’s study showed that a portfolio that goes long on "low-change" firms and short on "high-change" ones had a Sharpe ratio of 0.59.H2: Tone Drift in MD&A
If the tone in the "Management Discussion & Analysis" section (Item 7) becomes more negative compared to the prior year, the stock often underperforms.
▸ Backtests using the Loughran-McDonald sentiment dictionary found a −1.2% abnormal return in the following month when sentiment declined.H3: Litigation Mentions and Risk
A higher number of legal terms like “class action,” “FTC,” or “EPA” in Item 3 (“Legal Proceedings”) suggests the company may be facing serious regulatory or legal threats.
▸ Research from Berkeley shows the market tends to underreact to this, but implied volatility increases and stocks often fall the following quarter.H4: Complex Language Signals Risk
Companies that write in overly complex or dense language (as measured by readability scores) tend to have more future accounting misstatements. Investors often miss these red flags.
▸ Studies from 2010 to 2024, including recent work using LLM sentiment analysis, show that harder-to-read filings often lead to mispriced stocks.
Risks and Mitigation
Here are key risks when using 10-Ks for alpha generation and how to deal with them:
Boilerplate Language and ESG Marketing Bloat
Overuse of generic language and marketing terms (especially around ESG) can dilute the value of sentiment signals.
▸ Mitigation: Focus analysis on specific sections (e.g., Item 1A only), and use TF-IDF weighting to reduce the influence of overused words in each sector.Inconsistent Reporting Across Companies
Smaller or foreign companies may use different formats (like 20-F instead of 10-K), with different item numbers.
▸ Mitigation: Map all filings to a standard schema, and use models that adapt to language variations.Delayed Filings or NT-10K Notices
If a company files late or submits an NT-10K extension, it can be a warning sign, but also causes gaps in time-series data.
▸ Mitigation: Treat delay as a feature, and freeze each document version at the time it was available.Changing XBRL Tags
GAAP standards evolve yearly, introducing new tags and changing how data is reported.
▸ Mitigation: Use a dynamic mapping table and tie numeric features to higher-level US-GAAP elements to keep continuity.Regulation Fair Disclosure (Reg FD) Compliance
You can't redistribute full-text filings directly due to legal restrictions.
▸ Mitigation: Hash and store the original text securely, and only share derived metrics with end users.Alpha Crowding from Other Quants
Some of these signals, like tracking risk-factor changes, are already used by sell-side quant desks.
▸ Mitigation: Combine with less common features like text complexity, contingent liability detection, or spikes in cybersecurity-related keywords.