DEV Community

ZoktrFall
ZoktrFall

Posted on

Building a Website Contact Scraper API in .NET 10: Architecture, Crawling, and Fighting Cloudflare

Building a Website Contact Scraper API in .NET 10: Crawling, Extraction, and a Cloudflare Problem I Can't Fully Solve

I built an API that takes a domain and returns emails, phones, social profiles, and company info. One call:

GET /api/v1/website/contacts?domain=stripe.com
Enter fullscreen mode Exit fullscreen mode

Returns verified emails with confidence scores, phones, LinkedIn/Twitter/GitHub links, and crawl metadata. Here's how the interesting parts work.


Architecture

Clean layered architecture — Api → Application → Domain, with Infrastructure implementing the Application interfaces. The controller is 12 lines of plumbing. Everything real happens in the crawler and extractor.


The Two-Phase Crawler

The crawler uses a priority queue and runs in two phases.

Fast path — first 18 pages, only high-value routes: /contact, /about, /privacy, /legal. Gets real contacts in under 2 seconds for most sites.

Stage two — deferred URLs get promoted once the fast path finishes. Handles sites where contacts are buried under /company/offices/regional/emea/contact.

Every URL gets a priority score before entering the queue:

private static readonly (string Segment, int Score)[] PriorityPathSegments =
[
    ("/contact",    120),
    ("/contact-us", 118),
    ("/support",    115),
    ("/privacy",    110),
    ("/about",       95),
    ...
];
Enter fullscreen mode Exit fullscreen mode

Route family deduplication strips locale prefixes so /en/contact, /fr/contact, /de/contact are treated as one family and fetched once. This was the highest-leverage optimization — cut unnecessary fetches dramatically on international sites.


Email Extraction

Five passes over each page's DOM: text nodes, mailto: anchors, data-cfemail attributes, element attributes, and JSON-LD blocks.

The Cloudflare email decoder was satisfying to build — CF XORs each byte with the first byte of the encoded string:

private static string? DecodeCloudflareProtectedEmail(IElement element)
{
    var encoded = element.GetAttribute("data-cfemail");
    if (string.IsNullOrWhiteSpace(encoded) || encoded.Length % 2 != 0) return null;

    var key = Convert.ToByte(encoded[..2], 16);
    var characters = new char[(encoded.Length / 2) - 1];
    for (var i = 2; i < encoded.Length; i += 2)
        characters[(i / 2) - 1] = (char)(Convert.ToByte(encoded.Substring(i, 2), 16) ^ key);

    return new string(characters);
}
Enter fullscreen mode Exit fullscreen mode

Each email gets a confidence score built from multiple signals: domain match, role-based address, mailto: source, page context, footer placement, surrounding phrase ("email us at", "send resumes to"). Scoring beats hard accept/reject rules — real-world emails are messy.


Social Extraction

JSON-LD sameAs fields are the most reliable source. Sites that care about SEO publish their structured data carefully. Footer anchor tags are noisier — share buttons, partner links, and embedded widgets all look like profiles. Weighting sameAs much higher than anchors halved the false-positive rate.


The Cloudflare Problem I Haven't Fully Solved

This is where I'm stuck and genuinely want input from anyone who's dealt with this.

Locally, the crawler handles Cloudflare-protected sites reasonably well — persistent cookie jar, correct Sec-Fetch-* headers, headless Chrome fallback with a spoofed user agent. Works fine on my machine.

In production on Railway (datacenter IP), the same code gets blocked on a significant percentage of Cloudflare-protected sites. Challenge pages, 403s, silent blocks. The headless fallback helps but doesn't fully solve it.

My current setup:

// Persistent cookie jar across requests
handler.UseCookies = true;
handler.CookieContainer = new CookieContainer();

// Full Chrome header fingerprint
client.DefaultRequestHeaders.TryAddWithoutValidation("Sec-Fetch-Dest", "document");
client.DefaultRequestHeaders.TryAddWithoutValidation("Sec-Fetch-Mode", "navigate");
client.DefaultRequestHeaders.TryAddWithoutValidation("sec-ch-ua",
    "\"Google Chrome\";v=\"135\", \"Not-A.Brand\";v=\"8\"");
Enter fullscreen mode Exit fullscreen mode

I understand the core issue — datacenter IPs are pre-scored as high-risk by Cloudflare regardless of headers. Residential proxies are the obvious answer but add cost and complexity I haven't wired up yet.

What I'm wondering:

  • Has anyone solved this cleanly in .NET without proxies?
  • Is there a proxy provider that works well for this use case without breaking the bank?
  • Any other signals I'm missing that would help on datacenter IPs?

You can test the API yourself and see where it succeeds and fails — free tier, no credit card:
👉 https://rapidapi.com/zoktrapi-zoktrapi-default/api/website-contacts-finder

If you find a domain where results are wrong or missing, drop it in the comments. Genuinely useful for debugging.


Stack

.NET 10 · ASP.NET Core · HtmlAgilityPack · AngleSharp · Redis · headless Chrome · Railway

Happy to answer questions — and really hoping someone has cracked the datacenter IP problem.

Top comments (0)