Layering data sources: accept both APIs as fallback, don't choose one

#learning #backend

The single-source problem

You pick one free data API for financial information. It works well most of the time. But:

Some companies report through non-standard channels and the API misses them
Cash flow data for certain sectors is systematically wrong
The API rate-limits you and returns empty data without saying so
A company restructuring causes a gap in the data for 2–3 weeks

Your analysis breaks silently for affected companies. You don't know which ones until you manually check.

The layering pattern

Instead of choosing one source, define a priority order:

def get_cash_flow(ticker: str) -> float | None:
    # 1. Try the primary source
    primary = fetch_from_primary_api(ticker, "freeCashFlow")
    if primary is not None:
        return primary

    # 2. Fall back to secondary source (e.g. official regulatory filings)
    secondary = fetch_from_sec_filings(ticker, "FreeCashFlow")
    if secondary is not None:
        return secondary

    # 3. No data available
    return None

The primary source handles the majority of cases. The secondary source catches the gaps. Neither source needs to be perfect — together they cover more of the space.

Why "fallback" is better than "merge"

A tempting alternative is to merge data from both sources — average them, or take the max, or reconcile differences. This is more complex and introduces new failure modes: what if the two sources disagree significantly? Which one is right?

The fallback pattern is simpler: primary is trusted if available; secondary is used only when primary is absent. You never have to reconcile disagreement because you never look at the secondary if the primary gave you something.

Rate limit isolation

Two sources also means two rate limit buckets. If the primary API rate-limits you, the secondary is unaffected. You can fetch from the secondary while the primary recovers.

for ticker in tickers:
    data = get_data_with_fallback(ticker)
    process(data)
    time.sleep(2)  # still rate-limit between requests

The sleep still applies — you're still making requests to external APIs. But the sleep now protects two APIs simultaneously, and a rate limit on one doesn't stop the pipeline entirely.

Logging which source was used

For debugging and data quality monitoring, log which source provided each data point:

@dataclass
class DataPoint:
    value: float | None
    source: str  # "primary", "secondary", "none"
    ticker: str
    metric: str

This lets you answer: "what percentage of our data is coming from the fallback?" A high fallback rate for a specific metric signals that the primary source has a systematic gap there.

When to add a third source

Two sources cover most gaps. Add a third only when:

You have a specific metric that both primary and secondary miss for a meaningful portion of your universe
The third source requires significantly different authentication or rate limiting
You've measured the gap and it materially affects your analysis

Don't add sources speculatively. Each additional source adds maintenance overhead and the possibility of new failure modes. Add them in response to measured gaps, not anticipated ones.

The principle

Resilience in data pipelines comes from redundancy, not from finding the perfect single source. Accept that any single free API will have gaps. Layer sources to fill the gaps, log which source filled each gap, and monitor the distribution over time.