ZoktrFall

Posted on May 11

Building a Website Contact Scraper API in .NET 10: Architecture, Crawling, and Fighting Cloudflare

#csharp #dotnet #api #webdev

Building a Website Contact Scraper API in .NET 10: Crawling, Extraction, and a Cloudflare Problem I Can't Fully Solve

I built an API that takes a domain and returns emails, phones, social profiles, and company info. One call:

GET /api/v1/website/contacts?domain=stripe.com

Returns verified emails with confidence scores, phones, LinkedIn/Twitter/GitHub links, and crawl metadata. Here's how the interesting parts work.

Architecture

Clean layered architecture — Api → Application → Domain, with Infrastructure implementing the Application interfaces. The controller is 12 lines of plumbing. Everything real happens in the crawler and extractor.

The Two-Phase Crawler

The crawler uses a priority queue and runs in two phases.

Fast path — first 18 pages, only high-value routes: /contact, /about, /privacy, /legal. Gets real contacts in under 2 seconds for most sites.

Stage two — deferred URLs get promoted once the fast path finishes. Handles sites where contacts are buried under /company/offices/regional/emea/contact.

Every URL gets a priority score before entering the queue:

private static readonly (string Segment, int Score)[] PriorityPathSegments =
[
    ("/contact",    120),
    ("/contact-us", 118),
    ("/support",    115),
    ("/privacy",    110),
    ("/about",       95),
    ...
];

Route family deduplication strips locale prefixes so /en/contact, /fr/contact, /de/contact are treated as one family and fetched once. This was the highest-leverage optimization — cut unnecessary fetches dramatically on international sites.

Email Extraction

Five passes over each page's DOM: text nodes, mailto: anchors, data-cfemail attributes, element attributes, and JSON-LD blocks.

The Cloudflare email decoder was satisfying to build — CF XORs each byte with the first byte of the encoded string:

private static string? DecodeCloudflareProtectedEmail(IElement element)
{
    var encoded = element.GetAttribute("data-cfemail");
    if (string.IsNullOrWhiteSpace(encoded) || encoded.Length % 2 != 0) return null;

    var key = Convert.ToByte(encoded[..2], 16);
    var characters = new char[(encoded.Length / 2) - 1];
    for (var i = 2; i < encoded.Length; i += 2)
        characters[(i / 2) - 1] = (char)(Convert.ToByte(encoded.Substring(i, 2), 16) ^ key);

    return new string(characters);
}

Each email gets a confidence score built from multiple signals: domain match, role-based address, mailto: source, page context, footer placement, surrounding phrase ("email us at", "send resumes to"). Scoring beats hard accept/reject rules — real-world emails are messy.

Social Extraction

JSON-LD sameAs fields are the most reliable source. Sites that care about SEO publish their structured data carefully. Footer anchor tags are noisier — share buttons, partner links, and embedded widgets all look like profiles. Weighting sameAs much higher than anchors halved the false-positive rate.

The Cloudflare Problem I Haven't Fully Solved

This is where I'm stuck and genuinely want input from anyone who's dealt with this.

Locally, the crawler handles Cloudflare-protected sites reasonably well — persistent cookie jar, correct Sec-Fetch-* headers, headless Chrome fallback with a spoofed user agent. Works fine on my machine.

In production on Railway (datacenter IP), the same code gets blocked on a significant percentage of Cloudflare-protected sites. Challenge pages, 403s, silent blocks. The headless fallback helps but doesn't fully solve it.

My current setup:

// Persistent cookie jar across requests
handler.UseCookies = true;
handler.CookieContainer = new CookieContainer();

// Full Chrome header fingerprint
client.DefaultRequestHeaders.TryAddWithoutValidation("Sec-Fetch-Dest", "document");
client.DefaultRequestHeaders.TryAddWithoutValidation("Sec-Fetch-Mode", "navigate");
client.DefaultRequestHeaders.TryAddWithoutValidation("sec-ch-ua",
    "\"Google Chrome\";v=\"135\", \"Not-A.Brand\";v=\"8\"");

I understand the core issue — datacenter IPs are pre-scored as high-risk by Cloudflare regardless of headers. Residential proxies are the obvious answer but add cost and complexity I haven't wired up yet.

What I'm wondering:

Has anyone solved this cleanly in .NET without proxies?
Is there a proxy provider that works well for this use case without breaking the bank?
Any other signals I'm missing that would help on datacenter IPs?

You can test the API yourself and see where it succeeds and fails — free tier, no credit card:
👉 https://rapidapi.com/zoktrapi-zoktrapi-default/api/website-contacts-finder

If you find a domain where results are wrong or missing, drop it in the comments. Genuinely useful for debugging.

Stack

.NET 10 · ASP.NET Core · HtmlAgilityPack · AngleSharp · Redis · headless Chrome · Railway

Happy to answer questions — and really hoping someone has cracked the datacenter IP problem.

DEV Community