I'm building ShenBi AI — an AI tool that turns Chinese short-video links (Douyin, Xiaohongshu, Kuaishou) into structured transcripts and rewrite-ready scripts. As a solo founder doing SEO myself, I needed a competitive analysis: who's ranking for douyin transcript and similar long-tail queries?
I asked Claude. It gave me a clean "top 10" list — ScreenApp, BibiGPT, Apify, and so on. Easy. I trusted it.
Then a small voice asked: is this actually what Google shows real users? Or is the AI returning some aggregator data (Bing? Brave? Custom Search API?) and calling it Google?
I decided to find out the hard way: scrape 32 real Google SERPs across 4 region/device combinations. Here's the build, and what I found.
TL;DR
- AI search tools (Claude/GPT WebSearch) returned a "top 10" that diverged significantly from what real Google actually shows.
- I missed 3 major competitors entirely (
turboscribe.ai,aitodo.co,stt.ai) that dominate real SERPs but didn't appear in the AI's results. - Real SERP top 1 changes by country for the same keyword. Same keyword: India top 1 = Kapwing, US top 1 = Aitodo.
- AI Overview triggers in 11 of 32 SERPs — asymmetric impact across regions and queries.
- The whole pipeline runs in ~7 minutes per quarterly snapshot. Code below.
Why "AI WebSearch ≠ real Google SERP"
When you ask Claude or ChatGPT to "search Google for X", you don't get real Google SERP. You get:
- A processed/aggregated response from whatever search API the AI is wired to
- Snippets and titles that may be cached or rewritten
- No SERP features: no AI Overview, no People Also Ask, no Discussions panels, no video carousels
- No regional variance: usually a US-centric default
For broad questions ("what's the capital of France"), this is fine. For SEO competitive analysis where you need to know what your competitors look like in real SERPs, it's not enough.
I had to scrape real Google.
The Anti-Bot Problem (and Why Plain Curl Fails)
You can't just curl https://google.com/search?q=... from a script. Google flags the request in milliseconds based on:
-
TLS / HTTP/2 fingerprint —
curlandrequestslook nothing like real browsers - JavaScript challenge — most SERP content is JS-rendered; static HTML is mostly an empty shell
- IP reputation — datacenter IPs (AWS / GCP / Oracle Cloud / etc.) are flagged on sight
After two /sorry/ redirects in 30 seconds, I knew: I needed real browser automation.
The Stack
I ended up with this combination, all free except for a $5/month Clash subscription (which I had anyway for general work):
| Tool | Why |
|---|---|
| CloakBrowser | Patched Chromium at the C++ source level (Canvas/WebGL/font/timezone fingerprinting). Drop-in Playwright API. Free. |
| Clash Verge (with API enabled) | Switch exit nodes by country (India, US, Hong Kong) for region-specific SERPs |
| Python + Playwright sync API | The collector |
| Persistent BrowserContext + manual captcha | First captcha per IP solved by hand → Google issues GOOGLE_ABUSE_EXEMPTION cookie → subsequent queries skip the challenge |
Critical detail: I run desktop UA on datacenter IPs, not mobile. Why? Mobile UA + datacenter IP is a contradiction Google flags instantly (mobile users don't browse from Oracle Cloud). Desktop UA passes the smell test.
The Collector (simplified)
The full collector is ~600 lines (region grouping, state recovery, scroll simulation, captcha pause), but the core flow is:
from cloakbrowser import launch_persistent_context
# 1. Switch Clash exit node to target country
clash.switch_to_country("in") # India
clash.verify_ip_country("in", max_wait=30) # confirm exit IP is actually India
# 2. Launch browser with persistent profile (cookie cache survives captcha solve)
ctx = launch_persistent_context(
user_data_dir=f"./profile_{country}_{device}",
user_agent=DESKTOP_UA,
viewport={"width": 1920, "height": 1080},
humanize=True, # CloakBrowser injects realistic mouse/click timing
locale="en-US",
timezone="Asia/Kolkata",
)
# 3. Navigate
page = ctx.new_page()
page.goto(f"https://www.google.com/search?q={query}&gl=in&hl=en&num=20&pws=0")
page.wait_for_timeout(5000) # JS render
# 4. Detect captcha
if "/sorry/" in page.url:
input("⚠️ Captcha triggered. Solve in browser, press Enter to retry...")
page.goto(url) # retry, cookie now valid
# 5. Trigger lazy-load (scroll to bottom, then back)
for _ in range(5):
page.mouse.wheel(0, 800)
page.wait_for_timeout(400)
page.evaluate("() => window.scrollTo(0, 0)")
# 6. Snapshot: HTML + full-page screenshot + metadata
html = page.content()
png = page.screenshot(full_page=True)
# ... write to disk with sha256 hash for audit trail
Then I parse the HTML with BeautifulSoup. Tip: mobile and desktop SERPs use different selectors for organic results (<h3> on desktop, <div role="heading"> on mobile). My first parser missed all mobile results until I caught this.
The Findings
Finding 1: AI WebSearch ≠ real SERP
For douyin transcript — top 5 according to Claude's WebSearch:
1. ScreenApp
2. ScreenApp (Chinese variant)
3. Apify
4. yeschat.ai
5. TokScript
Real Google (India, desktop, my datacenter IP):
1. screenapp.io
2. apify.com
3. stt.ai ← Claude missed this entirely
4. dupdub.com ← Claude missed
5. turboscribe.ai ← Claude missed
Three competitors I would have ignored in my SEO strategy if I'd trusted the AI alone.
Finding 2: Top 1 changes by country
For xiaohongshu transcript generator:
| Region/Device | Top 1 |
|---|---|
| India desktop | kapwing.com |
| India mobile | kapwing.com |
| US desktop | aitodo.co |
| US mobile | aitodo.co |
The AI WebSearch had told me ScreenApp was top 1 for everything Xiaohongshu-related. It's not even in the top 5 for this exact keyword. Different countries = different competitors.
Finding 3: AI Overview asymmetry
11 of 32 SERPs triggered AI Overview. Distribution surprised me:
- All 8 Hong Kong (Chinese) SERPs triggered it
- 3 of 16 English SERPs triggered (mostly
xiaohongshu transcriptvariants) - 0 of 8
douyin transcriptSERPs triggered
If you're doing SEO in a language Google has heavily AI-Overview-ified, expect significant click-through compression. If you're doing English transcript queries, you're (currently) safe.
Finding 4: My GSC ranking ≠ real SERP
Google Search Console reported I was ranking position 2.43 for my best keyword. The real SERP I scraped: I'm not in the top 14 anywhere. Two possibilities:
- GSC is averaging over 3 months of impressions, with rare top-3 hits pulling the mean down
- Datacenter IP sees a different SERP than real residential users
Either way: don't trust GSC ranks alone for strategy. Cross-validate.
What I'd Tell Past-Me
AI search tools are great for general questions, dangerous for SEO competitive research. Always cross-validate with at least one real SERP scrape.
Datacenter IP scraping is a first-pass tool, not a final answer. Google may show different results to residential vs datacenter exit nodes. For final strategy, validate with a residential proxy ($7-10 from IPRoyal for a one-shot run).
Mobile and desktop SERPs are completely different worlds. If your audience is 90% mobile (mine is — India + Pakistan + Thailand), don't analyze desktop only.
Quarterly cadence matters more than perfect data. Build the pipeline once, run it every 3 months, diff the results. Trends > single snapshots.
Build in public. I shared this approach in the project's repo so other indie founders doing SEO can fork it. (If anyone has tips on residential proxies that don't break the bank, please share.)
What's Next
I'm planning a residential IP run next month to compare against this datacenter snapshot, and probably a Bing/DuckDuckGo cross-comparison for users in countries where those have non-trivial market share.
If you're an indie founder doing your own SEO, I'd love to hear how you handle competitive research. Drop a comment below, or check out ShenBi AI — the project that started all this.
Top comments (0)