Junne欧阳

Posted on May 8

I Scraped 32 Real Google SERPs to Validate My AI's Competitive Analysis. Here's What I Found.

#seo #python #indiehackers #buildinpublic

I'm building ShenBi AI — an AI tool that turns Chinese short-video links (Douyin, Xiaohongshu, Kuaishou) into structured transcripts and rewrite-ready scripts. As a solo founder doing SEO myself, I needed a competitive analysis: who's ranking for douyin transcript and similar long-tail queries?

I asked Claude. It gave me a clean "top 10" list — ScreenApp, BibiGPT, Apify, and so on. Easy. I trusted it.

Then a small voice asked: is this actually what Google shows real users? Or is the AI returning some aggregator data (Bing? Brave? Custom Search API?) and calling it Google?

I decided to find out the hard way: scrape 32 real Google SERPs across 4 region/device combinations. Here's the build, and what I found.

TL;DR

AI search tools (Claude/GPT WebSearch) returned a "top 10" that diverged significantly from what real Google actually shows.
I missed 3 major competitors entirely (turboscribe.ai, aitodo.co, stt.ai) that dominate real SERPs but didn't appear in the AI's results.
Real SERP top 1 changes by country for the same keyword. Same keyword: India top 1 = Kapwing, US top 1 = Aitodo.
AI Overview triggers in 11 of 32 SERPs — asymmetric impact across regions and queries.
The whole pipeline runs in ~7 minutes per quarterly snapshot. Code below.

Why "AI WebSearch ≠ real Google SERP"

When you ask Claude or ChatGPT to "search Google for X", you don't get real Google SERP. You get:

A processed/aggregated response from whatever search API the AI is wired to
Snippets and titles that may be cached or rewritten
No SERP features: no AI Overview, no People Also Ask, no Discussions panels, no video carousels
No regional variance: usually a US-centric default

For broad questions ("what's the capital of France"), this is fine. For SEO competitive analysis where you need to know what your competitors look like in real SERPs, it's not enough.

I had to scrape real Google.

The Anti-Bot Problem (and Why Plain Curl Fails)

You can't just curl https://google.com/search?q=... from a script. Google flags the request in milliseconds based on:

TLS / HTTP/2 fingerprint — curl and requests look nothing like real browsers
JavaScript challenge — most SERP content is JS-rendered; static HTML is mostly an empty shell
IP reputation — datacenter IPs (AWS / GCP / Oracle Cloud / etc.) are flagged on sight

After two /sorry/ redirects in 30 seconds, I knew: I needed real browser automation.

The Stack

I ended up with this combination, all free except for a $5/month Clash subscription (which I had anyway for general work):

Tool	Why
CloakBrowser	Patched Chromium at the C++ source level (Canvas/WebGL/font/timezone fingerprinting). Drop-in Playwright API. Free.
Clash Verge (with API enabled)	Switch exit nodes by country (India, US, Hong Kong) for region-specific SERPs
Python + Playwright sync API	The collector
Persistent BrowserContext + manual captcha	First captcha per IP solved by hand → Google issues `GOOGLE_ABUSE_EXEMPTION` cookie → subsequent queries skip the challenge

Critical detail: I run desktop UA on datacenter IPs, not mobile. Why? Mobile UA + datacenter IP is a contradiction Google flags instantly (mobile users don't browse from Oracle Cloud). Desktop UA passes the smell test.

The Collector (simplified)

The full collector is ~600 lines (region grouping, state recovery, scroll simulation, captcha pause), but the core flow is:

from cloakbrowser import launch_persistent_context

# 1. Switch Clash exit node to target country
clash.switch_to_country("in")  # India
clash.verify_ip_country("in", max_wait=30)  # confirm exit IP is actually India

# 2. Launch browser with persistent profile (cookie cache survives captcha solve)
ctx = launch_persistent_context(
    user_data_dir=f"./profile_{country}_{device}",
    user_agent=DESKTOP_UA,
    viewport={"width": 1920, "height": 1080},
    humanize=True,  # CloakBrowser injects realistic mouse/click timing
    locale="en-US",
    timezone="Asia/Kolkata",
)

# 3. Navigate
page = ctx.new_page()
page.goto(f"https://www.google.com/search?q={query}&gl=in&hl=en&num=20&pws=0")
page.wait_for_timeout(5000)  # JS render

# 4. Detect captcha
if "/sorry/" in page.url:
    input("⚠️  Captcha triggered. Solve in browser, press Enter to retry...")
    page.goto(url)  # retry, cookie now valid

# 5. Trigger lazy-load (scroll to bottom, then back)
for _ in range(5):
    page.mouse.wheel(0, 800)
    page.wait_for_timeout(400)
page.evaluate("() => window.scrollTo(0, 0)")

# 6. Snapshot: HTML + full-page screenshot + metadata
html = page.content()
png = page.screenshot(full_page=True)
# ... write to disk with sha256 hash for audit trail

Then I parse the HTML with BeautifulSoup. Tip: mobile and desktop SERPs use different selectors for organic results (<h3> on desktop, <div role="heading"> on mobile). My first parser missed all mobile results until I caught this.

The Findings

Finding 1: AI WebSearch ≠ real SERP

For douyin transcript — top 5 according to Claude's WebSearch:

1. ScreenApp
2. ScreenApp (Chinese variant)
3. Apify
4. yeschat.ai
5. TokScript

Real Google (India, desktop, my datacenter IP):

1. screenapp.io
2. apify.com
3. stt.ai           ← Claude missed this entirely
4. dupdub.com       ← Claude missed
5. turboscribe.ai   ← Claude missed

Three competitors I would have ignored in my SEO strategy if I'd trusted the AI alone.

Finding 2: Top 1 changes by country

For xiaohongshu transcript generator:

Region/Device	Top 1
India desktop	`kapwing.com`
India mobile	`kapwing.com`
US desktop	`aitodo.co`
US mobile	`aitodo.co`

The AI WebSearch had told me ScreenApp was top 1 for everything Xiaohongshu-related. It's not even in the top 5 for this exact keyword. Different countries = different competitors.

Finding 3: AI Overview asymmetry

11 of 32 SERPs triggered AI Overview. Distribution surprised me:

All 8 Hong Kong (Chinese) SERPs triggered it
3 of 16 English SERPs triggered (mostly xiaohongshu transcript variants)
0 of 8 douyin transcript SERPs triggered

If you're doing SEO in a language Google has heavily AI-Overview-ified, expect significant click-through compression. If you're doing English transcript queries, you're (currently) safe.

Finding 4: My GSC ranking ≠ real SERP

Google Search Console reported I was ranking position 2.43 for my best keyword. The real SERP I scraped: I'm not in the top 14 anywhere. Two possibilities:

GSC is averaging over 3 months of impressions, with rare top-3 hits pulling the mean down
Datacenter IP sees a different SERP than real residential users

Either way: don't trust GSC ranks alone for strategy. Cross-validate.

What I'd Tell Past-Me

AI search tools are great for general questions, dangerous for SEO competitive research. Always cross-validate with at least one real SERP scrape.
Datacenter IP scraping is a first-pass tool, not a final answer. Google may show different results to residential vs datacenter exit nodes. For final strategy, validate with a residential proxy ($7-10 from IPRoyal for a one-shot run).
Mobile and desktop SERPs are completely different worlds. If your audience is 90% mobile (mine is — India + Pakistan + Thailand), don't analyze desktop only.
Quarterly cadence matters more than perfect data. Build the pipeline once, run it every 3 months, diff the results. Trends > single snapshots.
Build in public. I shared this approach in the project's repo so other indie founders doing SEO can fork it. (If anyone has tips on residential proxies that don't break the bank, please share.)

What's Next

I'm planning a residential IP run next month to compare against this datacenter snapshot, and probably a Bing/DuckDuckGo cross-comparison for users in countries where those have non-trivial market share.

If you're an indie founder doing your own SEO, I'd love to hear how you handle competitive research. Drop a comment below, or check out ShenBi AI — the project that started all this.

DEV Community