DEV Community

Cover image for Scraping Chinese Social Platforms for LLM Training Data: A Practical Multi-Source Pipeline (Python, 2026)
Sami
Sami

Posted on

Scraping Chinese Social Platforms for LLM Training Data: A Practical Multi-Source Pipeline (Python, 2026)

If you're training Chinese-language models — or multilingual models that need real Chinese coverage, not just translated English — the data problem is the bottleneck. Common Crawl gives you the open web. HuggingFace gives you the curated stuff. But the linguistic patterns that matter most for cultural alignment — slang, memes, code-mixed English-Chinese, regional variations, real-time discourse — those live in places Common Crawl barely touches.

Three platforms that matter most for Chinese training corpora in 2026:

  • Weibo (微博) — 580M+ MAU, microblogging, real-time discourse, similar role to X/Twitter
  • Bilibili (哔哩哔哩) — 300M+ MAU, video platform, comments + danmaku give you code-mixed natural language at volume
  • Xiaohongshu / RedNote (小红书) — 300M+ MAU, lifestyle posts with longer-form content, female-skewed register

This post walks through how to build a multi-source pipeline that pulls clean structured data from all three, normalize across platforms, and ship it into your training datasets. With code, schema, and economics.

A note on legal posture: this entire pipeline accesses only publicly visible data — no auth bypass, no captcha solving, no scraping behind login. That matches the standard most AI training teams operate under in 2026, post-NYT-vs-OpenAI. Always consult your legal team for your specific use case and jurisdiction.

Why these three (and not, say, Douyin or Zhihu)

Each platform contributes a different linguistic register:

Weibo posts are short, high-frequency, conversational. Best for:

  • Everyday Mandarin patterns
  • Trending slang and memes (热搜 reflects what's actually viral right now)
  • Public sentiment on news and policy
  • Brand-mention contexts

Bilibili comments and danmaku are unique:

  • Heavy code-mixing English ↔ Chinese (gaming, tech, anime communities)
  • Real-time chat-style language
  • Subculture vocabulary (gaming, fandom, two-dimensional culture / 二次元)
  • Longer thread discussions on long-form videos

RedNote posts lean longer and more curated:

  • Beauty / lifestyle / travel / food vocabulary
  • Product-attribute language (skincare ingredients, fashion descriptors)
  • Female-skewed register and topics
  • Aspirational / descriptive framing

Douyin (Chinese TikTok) and Kuaishou are dominantly video — text data is sparse. Zhihu (Q&A) is great for long-form but dominated by single-author voice. The triad above gives you the best balance of volume, diversity, and accessibility.

Pipeline architecture

The cleanest architecture for an AI training data pipeline:

[Weibo Scraper]    →
[Bilibili Scraper] →  [Normalize]  →  [Dedup + Filter]  →  [JSONL]
[RedNote Scraper]  →
Enter fullscreen mode Exit fullscreen mode

Each scraper outputs platform-native JSON. A normalization layer flattens to a common schema. Deduplication on text hash + filtering by min-length / language detection ships clean data into your training format.

Below: I use Apify-hosted scrapers for the extraction layer (they handle anti-bot, rate limiting, and schema stability so you don't have to). The normalization + dedup is your code — straight Python.

Step 1 — Pulling from Weibo

For training data, the high-value combination is:

  • Hot search topics (real-time trending — what people are talking about right now)
  • Posts under those topics (organic conversation about real issues)
from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

def collect_weibo_corpus(target_topics: int = 50, posts_per_topic: int = 100):
    # 1a. Pull current trending topics
    topics_run = client.actor("zhorex/weibo-scraper").call(run_input={
        "mode": "hot_search",
        "maxResults": target_topics,
    })
    topics = list(client.dataset(topics_run["defaultDatasetId"]).iterate_items())

    # 1b. For each topic, pull underlying posts
    corpus = []
    for topic in topics:
        posts_run = client.actor("zhorex/weibo-scraper").call(run_input={
            "mode": "search",
            "searchQuery": topic["title"],
            "maxResults": posts_per_topic,
        })
        for post in client.dataset(posts_run["defaultDatasetId"]).iterate_items():
            corpus.append({
                "platform": "weibo",
                "topic": topic["title"],
                "category": topic.get("category"),
                "text": post.get("text", ""),
                "author": post.get("authorName"),
                "engagement": (post.get("attitudesCount", 0) +
                               post.get("commentsCount", 0) +
                               post.get("repostsCount", 0)),
                "post_url": post.get("postUrl"),
                "scraped_at": post.get("scrapedAt"),
            })
    return corpus
Enter fullscreen mode Exit fullscreen mode

Volume math: 50 topics × 100 posts = 5,000 items per snapshot. At $0.005/item that's $25 per pull. Run daily for a year ≈ $9,125.

Step 2 — Pulling from Bilibili

Bilibili gives you something the others don't: comments on long-form videos. That's where heavy code-mixing happens (tech tutorials, gaming streams, study-with-me content, drama analysis). For training data, comments are higher-value than video metadata.

def collect_bilibili_comments(category: str = "knowledge",
                               videos: int = 50,
                               comments_per: int = 100):
    # Get popular videos in the category
    popular_run = client.actor("zhorex/bilibili-scraper").call(run_input={
        "mode": "popular",
        "category": category,
        "maxResults": videos,
    })
    items = list(client.dataset(popular_run["defaultDatasetId"]).iterate_items())
    bvids = [v["bvid"] for v in items if v.get("bvid")]

    # Pull comments on each
    corpus = []
    for bvid in bvids:
        comments_run = client.actor("zhorex/bilibili-scraper").call(run_input={
            "mode": "video_comments",
            "videoUrls": [f"https://www.bilibili.com/video/{bvid}"],
            "maxComments": comments_per,
            "sortComments": "hot",
        })
        for c in client.dataset(comments_run["defaultDatasetId"]).iterate_items():
            if c.get("type") != "comment":
                continue
            corpus.append({
                "platform": "bilibili",
                "category": category,
                "text": c.get("text", ""),
                "author": c.get("authorName"),
                "engagement": c.get("likeCount", 0),
                "video_bvid": bvid,
                "scraped_at": c.get("scrapedAt"),
            })
    return corpus
Enter fullscreen mode Exit fullscreen mode

Note: Bilibili throttles comment depth on cloud IPs — top ~3 per video without residential proxies. For training-data scale you don't need every comment, just enough diversity, so the top-N approach is fine and cheaper.

Categories worth pulling for diverse coverage: knowledge, tech, game, life, food, fashion, cars, entertainment.

Step 3 — Pulling from RedNote

RedNote gives you longer, more curated content — good for training models on aspirational and descriptive Chinese. The seed-query approach lets you control topical distribution, important for avoiding bias toward whatever's trending the day you scrape.

def collect_rednote_corpus(seed_queries: list, posts_per_query: int = 50):
    corpus = []
    for query in seed_queries:
        run = client.actor("zhorex/rednote-xiaohongshu-scraper").call(run_input={
            "mode": "search",
            "searchQuery": query,
            "maxResults": posts_per_query,
        })
        for post in client.dataset(run["defaultDatasetId"]).iterate_items():
            corpus.append({
                "platform": "rednote",
                "topic": query,
                "text": post.get("title", ""),
                "author": (post.get("author") or {}).get("nickname"),
                "engagement": post.get("likes", 0),
                "post_url": post.get("postUrl"),
                "scraped_at": post.get("scrapedAt"),
            })
    return corpus

# Diverse seed queries spread coverage across topics
seeds = [
    "护肤心得",      # skincare experience
    "穿搭",          # outfits
    "美食推荐",      # food recommendations
    "旅行攻略",      # travel guides
    "健身打卡",      # fitness check-in
    "读书笔记",      # reading notes
    "育儿日记",      # parenting diary
    "职场感悟",      # work reflections
]
rednote_data = collect_rednote_corpus(seeds, posts_per_query=100)
Enter fullscreen mode Exit fullscreen mode

For richer body content per post (beyond title), pivot to mode: post_details with the post URLs you want to deep-dive on.

Step 4 — Normalization and dedup

All three scrapers produce platform-specific schemas; the per-step code above already brings them to a common shape:

{
    "platform": "weibo" | "bilibili" | "rednote",
    "topic": str,
    "text": str,
    "author": str,
    "engagement": int,
    "scraped_at": ISO8601,
}
Enter fullscreen mode Exit fullscreen mode

Enough to ship into a JSONL training format. For higher quality, layer in filtering:

import hashlib

def filter_corpus(corpus, min_chars: int = 10, max_chars: int = 5000):
    seen = set()
    out = []
    for item in corpus:
        text = (item.get("text") or "").strip()
        if not (min_chars <= len(text) <= max_chars):
            continue
        h = hashlib.md5(text.encode("utf-8")).hexdigest()
        if h in seen:
            continue
        seen.add(h)
        out.append(item)
    return out
Enter fullscreen mode Exit fullscreen mode

For pretraining-grade quality, also add fastText / langdetect to filter non-Chinese content, and a profanity / PII pass appropriate to your training context.

Economics at training-corpus scale

A reasonable Chinese-language pretraining contribution might be 10M items across platforms:

Platform Items Cost @ $0.005
Weibo 5M $25,000
Bilibili 3M $15,000
RedNote 2M $10,000
Total 10M items $50,000

Apify free tier ($5/month credit) covers ~1,000 items per actor for prototyping.

For comparison, hiring 2 senior engineers to build and maintain DIY Chinese-platform extraction for 6 months: $150K-300K — and you don't even get the data, just the tooling.

For 100M+ items (real pretraining scale), volume pricing or a custom enterprise contract makes sense. See enterprise section below.

When to build vs buy

Build it yourself if:

  • You're scraping 100M+ items per month and have a dedicated team
  • You need real-time streaming below 1-second latency (this pipeline is batch)
  • Your legal team requires you to own the entire data path

Use the hosted scrapers if:

  • You're under 50M items per month per platform
  • You want time-to-data measured in hours, not months
  • You don't want to maintain three platform-specific scrapers as APIs evolve

The actors

All three at $0.005/result. Pure HTTP — no browser, no proxy required for moderate volumes.

Enterprise / training-scale

If you're building actual training corpora (not prototyping), DM me on any actor page or open an Issue with subject "Training data inquiry":

  • Custom output schemas matched to your training pipeline (Parquet / Arrow / your dialect of JSONL)
  • Volume pricing above 1M items/month per platform
  • Dedicated proxy infrastructure for sustained throughput
  • Schema stability SLA so your training runs don't break mid-epoch

Issues typically get a response within 48 hours.

FAQ

Is this legal? Each Actor accesses only publicly visible data — no auth, no captcha bypass, no login walls. The same data any anonymous browser user can see. Standard ToS-compliant scraping posture as of 2026. Consult your legal team for jurisdiction-specific guidance.

What about rate limits? The hosted Actors handle rate-limit responses with exponential backoff. For 1M+ items/day per platform, talk to me about dedicated infrastructure.

Can I get historical data? The Actors return what's currently public. For longitudinal datasets, schedule them via Apify Schedules at the cadence you need (hourly / daily / weekly) and version-control your dataset snapshots.

Do you offer streaming / real-time? Not currently. The Actors are pull-based. If you need streaming, that's a custom integration.

Other platforms? I also maintain a RedNote Shop Scraper for Xiaohongshu e-commerce listings — useful if your model needs to reason about products, pricing, or commerce vocabulary.


Other relevant work

If you're building Chinese intelligence at scale, the full suite:

If this saved you a quarter of dev time, a 30-second review on any of the Actor pages helps a lot. ⭐

Found a bug or have a feature request? Open an Issue — I usually ship fixes within 48 hours.

Top comments (0)