The single-source problem
You pick one free data API for financial information. It works well most of the time. But:
- Some companies report through non-standard channels and the API misses them
- Cash flow data for certain sectors is systematically wrong
- The API rate-limits you and returns empty data without saying so
- A company restructuring causes a gap in the data for 2–3 weeks
Your analysis breaks silently for affected companies. You don't know which ones until you manually check.
The layering pattern
Instead of choosing one source, define a priority order:
def get_cash_flow(ticker: str) -> float | None:
# 1. Try the primary source
primary = fetch_from_primary_api(ticker, "freeCashFlow")
if primary is not None:
return primary
# 2. Fall back to secondary source (e.g. official regulatory filings)
secondary = fetch_from_sec_filings(ticker, "FreeCashFlow")
if secondary is not None:
return secondary
# 3. No data available
return None
The primary source handles the majority of cases. The secondary source catches the gaps. Neither source needs to be perfect — together they cover more of the space.
Why "fallback" is better than "merge"
A tempting alternative is to merge data from both sources — average them, or take the max, or reconcile differences. This is more complex and introduces new failure modes: what if the two sources disagree significantly? Which one is right?
The fallback pattern is simpler: primary is trusted if available; secondary is used only when primary is absent. You never have to reconcile disagreement because you never look at the secondary if the primary gave you something.
Rate limit isolation
Two sources also means two rate limit buckets. If the primary API rate-limits you, the secondary is unaffected. You can fetch from the secondary while the primary recovers.
for ticker in tickers:
data = get_data_with_fallback(ticker)
process(data)
time.sleep(2) # still rate-limit between requests
The sleep still applies — you're still making requests to external APIs. But the sleep now protects two APIs simultaneously, and a rate limit on one doesn't stop the pipeline entirely.
Logging which source was used
For debugging and data quality monitoring, log which source provided each data point:
@dataclass
class DataPoint:
value: float | None
source: str # "primary", "secondary", "none"
ticker: str
metric: str
This lets you answer: "what percentage of our data is coming from the fallback?" A high fallback rate for a specific metric signals that the primary source has a systematic gap there.
When to add a third source
Two sources cover most gaps. Add a third only when:
- You have a specific metric that both primary and secondary miss for a meaningful portion of your universe
- The third source requires significantly different authentication or rate limiting
- You've measured the gap and it materially affects your analysis
Don't add sources speculatively. Each additional source adds maintenance overhead and the possibility of new failure modes. Add them in response to measured gaps, not anticipated ones.
The principle
Resilience in data pipelines comes from redundancy, not from finding the perfect single source. Accept that any single free API will have gaps. Layer sources to fill the gaps, log which source filled each gap, and monitor the distribution over time.
Top comments (0)