DEV Community

Building an AI-Native Retail Platform on GCP: Personalization + Multi-Agent Ops + Agentic RAG as One Unified Stack

A shopper searches for rain boots on your storefront. Within 120ms, your personalization engine surfaces the right products. A stock alert fires, and three AI agents coordinate a reorder without a human touching a keyboard. The customer asks a question in chat — the answer comes back grounded in live inventory and your return policy, cited and accurate.

This is not three separate AI projects. It is one unified platform — and this article shows you how to build it on GCP.


🏗️ The Three Layers of an AI-Native Retail Platform

Most retail AI initiatives start with one use case and stop there. What makes a platform is when these three capabilities are designed together, sharing infrastructure and data:

Layer What It Does GCP Services
Real-Time Personalization Surfaces relevant products from millions of SKUs in < 120ms Pub/Sub, Dataflow, Vertex AI Matching Engine, Feature Store, Cloud Run
Multi-Agent Operations Coordinates inventory, pricing, supplier, and customer agents in parallel Vertex AI Reasoning Engine, Pub/Sub, BigQuery ML, Cloud Run
Agentic RAG Answers complex queries grounded in live data + policy docs Vertex AI Search, Gemini, BigQuery (as a live tool)

The key insight: all three layers share the same data backbone — BigQuery as the source of truth, Pub/Sub as the event spine, and Vertex AI as the intelligence layer.


📐 Unified Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        FRONTEND / API GATEWAY                   │
└───────────┬──────────────────┬───────────────────┬─────────────┘
            │                  │                   │
    ┌───────▼──────┐  ┌────────▼───────┐  ┌───────▼──────────┐
    │ PERSONALI-   │  │  MULTI-AGENT   │  │   AGENTIC RAG    │
    │ ZATION       │  │  ORCHESTRATOR  │  │   (Customer Q&A) │
    │ ENGINE       │  │  (Gemini 1.5)  │  │   (Gemini +      │
    │ (Cloud Run)  │  │  (Vertex AI    │  │    Vertex Search) │
    └───────┬──────┘  │   Reasoning)   │  └───────┬──────────┘
            │         └────────┬───────┘          │
            │                  │                  │
            └──────────────────┼──────────────────┘
                               │
              ┌────────────────▼────────────────┐
              │         GOOGLE CLOUD PUB/SUB     │
              │         (Shared Event Spine)      │
              └───┬──────────┬──────────┬────────┘
                  │          │          │
          ┌───────▼──┐ ┌─────▼────┐ ┌──▼────────────┐
          │ Dataflow  │ │Specialist│ │ Vertex AI     │
          │ Streaming │ │ Agents   │ │ Search Index  │
          └───────┬──┘ └─────┬────┘ └──┬────────────┘
                  │          │          │
              ┌───▼──────────▼──────────▼───┐
              │          BIGQUERY            │
              │   (Shared Operational Store) │
              └─────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

🎯 Layer 1: Real-Time Personalization Engine

The Core Problem

Daily batch recommendations ignore the most powerful signal available: what the user is doing right now. A shopper who just added rain boots to their cart does not want yesterday's trending sneakers.

Design principle: Personalization is a retrieval problem. Given a user and their context right now, find the items most likely to convert — in under 120ms.

The Six-Stage Pipeline

Stage 1 — Event Capture (Pub/Sub)

Every user interaction fires a structured event to Pub/Sub. The client SDK is fire-and-forget — it does not wait for a response.

{
  "event_type": "CART_ADD",
  "user_id": "u_8821",
  "sku_id": "SKU-4471",
  "session_id": "s_992abc",
  "ts": "2026-03-22T14:03:11Z",
  "context": { "device": "mobile", "location": "Atlanta, GA" }
}
Enter fullscreen mode Exit fullscreen mode

Stage 2 — Stream Enrichment (Dataflow)

A Dataflow streaming job picks up events, joins with item metadata from BigQuery, and writes two outputs:

  • Session feature update → Vertex AI Feature Store (< 5s latency)
  • Interaction log → BigQuery (for offline model training)

Stage 3 — Feature Assembly (Vertex AI Feature Store)

At query time, three feature groups are fetched in a single low-latency call:

feature_store_client.read_feature_values(
    entity_type="user",
    entity_ids=[user_id],
    feature_selector={
        "id_matcher": {
            "ids": ["purchase_history", "session_clicks", "device_type", "location"]
        }
    }
)
Enter fullscreen mode Exit fullscreen mode

Stage 4 — ANN Retrieval (Vertex AI Matching Engine)

The assembled user context vector is submitted to Matching Engine — Google's managed ANN index. It returns the top 50 candidate SKUs from a catalog of millions in under 10ms.

response = index_endpoint.find_neighbors(
    deployed_index_id="retail_item_embeddings",
    queries=[user_context_vector],
    num_neighbors=50
)
Enter fullscreen mode Exit fullscreen mode

Under the hood: Google's ScaNN algorithm, pre-filtered by in-stock status so the re-ranker never sees unavailable items.

Stage 5 — Re-Ranking (Vertex AI Prediction)

A lightweight model re-scores the 50 candidates using signals the embedding index cannot capture:

  • Current inventory level
  • Promotional pricing flag
  • User's price sensitivity segment
  • Real-time trend score

Stage 6 — Serve (Cloud Run)

Top 10 results + display metadata returned to the frontend. End-to-end: < 120ms at p99.

Handling Cold Start

Scenario Strategy
New user (no history) Serve contextual top-trending items by device + time + location
New item (no interactions) Content-based embedding from product description + image on ingestion
After first click Session features kick in within 5 seconds

🤖 Layer 2: Multi-Agent Operations

The Core Problem

A single LLM handling all retail operations hits three walls: context overload, sequential latency, and unmaintainable prompts. When the inventory rule, pricing model, supplier contract, and customer policy all need to fit in one context — reasoning quality degrades.

Design principle: Treat operations like a well-run team. One orchestrator receives requests and coordinates specialists. Each specialist does one thing well.

Agent Architecture

Operator / System Trigger
        │
        ▼
┌─────────────────────────────────┐
│  ORCHESTRATOR AGENT             │
│  Gemini 1.5 Pro                 │
│  Vertex AI Reasoning Engine     │
│  - Decomposes tasks             │
│  - Routes to specialists        │
│  - Synthesizes final response   │
└────┬──────────┬──────────┬──────┘
     │  Pub/Sub │          │
     ▼          ▼          ▼
┌─────────┐ ┌────────┐ ┌──────────┐ ┌──────────┐
│Inventory│ │Pricing │ │Supplier  │ │Customer  │
│Agent    │ │Agent   │ │Agent     │ │Agent     │
│BigQuery │ │BQ ML   │ │Vertex AI │ │Agentic   │
│         │ │        │ │Search    │ │RAG ←────── Layer 3
└─────────┘ └────────┘ └──────────┘ └──────────┘
Enter fullscreen mode Exit fullscreen mode

Notice: the Customer Agent IS Layer 3 — Agentic RAG is not separate, it is the intelligence layer of the Customer Agent. This is where the three layers connect.

A Reorder Request — Traced End-to-End

Input: "Should we reorder SKU-991?"

Step 1 — Decompose: Orchestrator identifies three parallel sub-tasks.

tasks = orchestrator.decompose(query)
# → [
#     {"agent": "inventory", "task": "get_stock_level", "sku": "SKU-991"},
#     {"agent": "supplier",  "task": "get_eta_and_cost", "sku": "SKU-991"},
#     {"agent": "pricing",   "task": "get_reorder_cost", "sku": "SKU-991"}
# ]
Enter fullscreen mode Exit fullscreen mode

Step 2 — Dispatch: All three tasks published to Pub/Sub simultaneously.

Step 3 — Execute in Parallel: Each Cloud Run agent handles its task independently:

# Inventory Agent
stock = bq_client.query("""
    SELECT units_available FROM inventory_snapshot
    WHERE sku_id = 'SKU-991' AND store_id = 'DC-ATL'
""").result()

# Pricing Agent (BigQuery ML)
reorder_cost = bq_client.query("""
    SELECT ML.PREDICT(MODEL `retail.pricing_model`,
        (SELECT * FROM pricing_signals WHERE sku_id = 'SKU-991'))
""").result()
Enter fullscreen mode Exit fullscreen mode

Step 4 — Synthesize:

Orchestrator → "Reorder 50 units from Vendor A at $4.20/unit, ETA 3 days. 
                Current stock: 8 units (below reorder threshold of 15)." ✅
Enter fullscreen mode Exit fullscreen mode

Total time = max(slowest agent) — not the sum of all three.

The Pub/Sub Design — Why It Matters

Three properties you get for free:

  • Loose coupling: agents have no direct dependency on each other, only on topic names
  • Fault tolerance: if an agent crashes, the message is retained and redelivered on recovery
  • Independent scaling: each Cloud Run agent scales on its own Pub/Sub queue depth

Shared Memory: The agent_decision_log Table

Every orchestrated request is fully logged:

CREATE TABLE retail.agent_decision_log (
  request_id      STRING,
  ts              TIMESTAMP,
  agent_called    STRING,
  tools_used      ARRAY<STRING>,
  input_payload   JSON,
  output_payload  JSON,
  latency_ms      INT64,
  confidence      FLOAT64
);
Enter fullscreen mode Exit fullscreen mode

This table powers weekly evaluation reports and feeds back into model fine-tuning — your audit trail is also your training dataset.


📚 Layer 3: Agentic RAG for Retail Knowledge

The Core Problem

Standard RAG (embed query → retrieve chunks → generate) fails retail because:

  • A single customer question often spans multiple knowledge domains (policy + inventory + product specs)
  • Inventory data goes stale in minutes — you cannot index it as static documents
  • Retrieval confidence varies — a system that cannot detect low-confidence answers will hallucinate

Design principle: RAG should reason, not just retrieve. The agent decides which source to query, validates the result, and cites its sources.

Three Retrieval Sources

1. Policy & Compliance Index (Vertex AI Search)

Return policies, warranty terms, BOPIS rules, hazmat shipping. Indexed as documents with hybrid retrieval (dense semantic + sparse BM25 keyword).

BM25 matters here: product part numbers and model codes are not well-served by pure vector search. Hybrid retrieval handles both.

2. Product Catalog Index (Vertex AI Search)

Product descriptions, specs, compatibility notes, sizing guides. Indexed with multimodal embeddings (text + image) so "waterproof jacket similar to this one" works.

3. Live Operational Data (BigQuery as a Tool)

Inventory levels, order status, real-time pricing — not indexed as documents but called as a live tool. This is the key architectural decision that prevents stale answers.

tools = [
    VertexAISearchTool(index="retail_policy_index"),
    VertexAISearchTool(index="retail_product_index"),
    BigQueryTool(query_template=INVENTORY_QUERY)  # live call, not indexed
]
Enter fullscreen mode Exit fullscreen mode

Query Decomposition in Action

Customer query: "Can I return the 40V battery I bought online at a store, and is it in stock at the Cumming, GA location?"

Agent Plan:
  Sub-query A  Policy Index: "online purchase battery return policy in-store"
  Sub-query B  BigQuery Tool: SELECT units_available 
                               FROM inventory_snapshot 
                               WHERE sku_id='SKU-4471' AND store='GA-CUMMING'
Enter fullscreen mode Exit fullscreen mode

Agent validates Sub-query A: relevance score > 0.82 threshold ✅

Agent validates Sub-query B: live data, timestamp 2 minutes ago ✅

Synthesized answer:

"Yes — online purchases can be returned in-store within 90 days (Policy §3.2). 
The 40V battery (SKU-4471) shows 3 units in stock at Cumming, GA 
as of 14:07 EST today."
Enter fullscreen mode Exit fullscreen mode

Every fact is cited. No hallucination. No "please check the website."

The Self-Correction Loop

MAX_RETRIES = 3

for attempt in range(MAX_RETRIES):
    result = vertex_search.retrieve(query, index=index_id)

    if result.confidence_score >= THRESHOLD:
        return result

    # Reformulate: broaden scope, try synonyms, switch retrieval mode
    query = agent.reformulate(query, attempt)

# After max retries: escalate to human agent queue
escalate_to_human(original_query)
Enter fullscreen mode Exit fullscreen mode

This loop means your system knows what it does not know — and routes accordingly.


🔗 How the Three Layers Connect

The platform is unified, not assembled. Here is how data and events flow across all three layers in a single customer session:

1. Customer browses → Pub/Sub event → Personalization Engine 
                      surfaces relevant products (Layer 1)

2. Inventory drops below threshold → Pub/Sub alert → 
   Orchestrator Agent dispatches reorder across 3 specialist 
   agents in parallel (Layer 2)

3. Customer asks: "Is this in stock?" → Customer Agent (Layer 2) 
   → Agentic RAG (Layer 3) queries BigQuery live + policy index
   → grounded, cited answer in < 2s

4. All events → BigQuery agent_decision_log + interaction_log
   → weekly eval reports + model retraining for Layers 1 & 3
Enter fullscreen mode Exit fullscreen mode

The feedback loop is the platform. Every interaction trains the next version of every model.


📊 Observability — One Dashboard, Three Layers

All three layers write to BigQuery. One Looker Studio dashboard covers the full platform:

Metric Layer Source Table
Recommendation CTR by segment Personalization interaction_log
ANN retrieval latency p99 Personalization serving_metrics
Agent task parallelism ratio Multi-Agent agent_decision_log
Reorder decision accuracy Multi-Agent agent_decision_log
RAG retrieval precision@5 Agentic RAG agent_query_log
Re-query rate Agentic RAG agent_query_log

When retrieval precision drops, you know before customers notice.


🚀 Where to Start

Don't try to ship all three layers at once. Here is a proven sequencing:

Week 1–4: Lay the data foundation

  • Set up BigQuery tables: inventory_snapshot, interaction_log, agent_decision_log
  • Stand up Pub/Sub topics and Dataflow streaming job
  • This infrastructure is shared by all three layers — do it once, use it everywhere

Week 5–8: Ship Personalization (Layer 1)

  • Train a two-tower model on BigQuery interaction history
  • Index item embeddings into Vertex AI Matching Engine
  • Wire up Cloud Run serving API
  • Measure: recommendation CTR vs. batch baseline

Week 9–12: Add Multi-Agent Ops (Layer 2)

  • Start with two agents: Inventory + Pricing
  • Orchestrator on Vertex AI Reasoning Engine
  • Add Supplier Agent once the first two are stable

Week 13–16: Add Agentic RAG (Layer 3)

  • Index return policy + product catalog into Vertex AI Search
  • Wire the BigQuery inventory tool into the agent
  • Deploy as the Customer Agent inside your multi-agent system

The Pub/Sub bus means each new layer plugs in without touching what already works.


💡 Key Takeaways

  • Share infrastructure, not code. BigQuery and Pub/Sub serve all three layers. Build them once.
  • The Customer Agent IS Agentic RAG. Don't build these as separate projects.
  • The agent_decision_log is your most valuable table. It is your audit trail, your eval dataset, and your retraining signal.
  • Personalization cold start is solved by context, not history. Device + time + location gets you 80% of the way there for new users.
  • Hybrid retrieval beats pure vector search for retail. BM25 handles part numbers and model codes that semantic search misses.

Top comments (0)