Technical Setup for GEO

Content and schema matter only if AI crawlers can actually access, render, and index your site correctly. This page covers the technical configuration layer of GEO: rendering, crawlability, AI bot access, and product manifest discovery.

1. Rendering — Schema Must Be in the Initial HTML

AI crawlers may not execute JavaScript, and even those that do may not wait for async rendering to complete. If your structured data or key content is injected after page load by JavaScript, it may be invisible to AI systems.

Server-Side Rendering (SSR) Requirements

Content Type	Must be SSR	Notes
JSON-LD schema blocks	Yes	Never inject via `useEffect` or async JS
Page title and meta description	Yes	Required for AI snippet extraction
Product name, price, availability	Yes	Core e-commerce data AI systems need
FAQ section content	Yes	Must be in initial HTML for FAQ schema to work
Review and rating content	Yes	Aggregate ratings must be server-rendered
Breadcrumb navigation	Yes	Schema and visible crumbs must match
Open Graph / Twitter Card tags	Yes	Used by social AI and link previews

What AI Crawlers May Skip

Content loaded via lazy loading (IntersectionObserver) if not pre-rendered
Data fetched from APIs on the client side (prices from a headless API, for example)
Content inside <iframe> elements
Content gated behind login or cookie consent walls
Schema injected via Google Tag Manager (GTM is not reliable for AI crawlers)

For React / Headless Platforms

Use SSR or static site generation (SSG) frameworks:

Next.js — Use getServerSideProps or getStaticProps for product data
Gatsby — Static generation of product pages at build time
Nuxt.js — Server-side rendering mode
WooCommerce — Standard WordPress PHP rendering is SSR by default
BigCommerce — Stencil themes render server-side; ensure schema is in {{inject}} calls, not post-load JS

2. AI Bot Access — Do Not Block AI Crawlers

A significant percentage of sites unintentionally block AI crawlers through WAF rules, bot protection settings, or overly aggressive robots.txt configurations. This is invisible to humans but eliminates GEO potential entirely.

Whitelist These AI Crawler User Agents

Ensure these user agents are explicitly allowed in your WAF, CDN, and robots.txt:

Bot Name	User Agent	System
Google's AI crawler	`Googlebot`	Google AI Overviews, Shopping Graph
Google's extended bot	`Google-Extended`	Google Gemini training data
OpenAI's crawler	`GPTBot`	ChatGPT, OpenAI systems
OpenAI's realtime crawler	`ChatGPT-User`	ChatGPT search
Anthropic's crawler	`ClaudeBot`	Claude AI
Perplexity's crawler	`PerplexityBot`	Perplexity AI
Meta's crawler	`Meta-ExternalAgent`	Meta AI
Common Crawl	`CCBot`	Used by many AI training datasets

`robots.txt` Configuration

User-agent: *
Allow: /

# Explicitly allow all major AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

Sitemap: https://www.example.com/sitemap.xml

# AI resource declarations (emerging standards)
LLMs: https://www.example.com/llms.txt
Manifest: https://www.example.com/products-manifest.json
MCP: https://www.example.com/mcp

If your site has an MCP server, also add:

Allow: /mcp
Allow: /.well-known/mcp.json

For full MCP discovery setup — including .well-known/mcp.json, llms.txt references, registry submission, and tool descriptions — see the dedicated MCP Discovery page.

Cloudflare and WAF Settings

If you use Cloudflare or another WAF:

Cloudflare Bot Fight Mode — If enabled, may challenge or block legitimate AI crawlers. Audit your Bot Analytics dashboard and create explicit allow rules for AI bot user agents.
Security Level settings — "High" or "I'm Under Attack" modes can block crawlers. Use WAF rules instead of global security level increases.
Rate limiting rules — Do not apply rate limits to the user agents listed above.
Challenge pages — CAPTCHA and JS challenge pages are invisible to AI crawlers. Create bypass rules for known AI bots.

Cloudflare WAF Rule for AI Bots

(http.user_agent contains "GPTBot" or 
 http.user_agent contains "ClaudeBot" or 
 http.user_agent contains "PerplexityBot" or 
 http.user_agent contains "Google-Extended") → Skip → All WAF Rules

3. Product Manifest Discovery

A product manifest is a machine-readable JSON file at a stable URL that lists your entire product catalog with structured metadata. It is an emerging standard for AI product discovery that ChatGPT Search, Perplexity, and other AI commerce systems are beginning to actively look for.

Manifest File Structure

Host at: https://www.example.com/products-manifest.json

{
  "version": "1.0",
  "generated": "2026-05-28T00:00:00Z",
  "site": "https://www.example.com",
  "totalProducts": 847,
  "products": [
    {
      "sku": "YAM-5S-GP",
      "gtin": "047925210175",
      "name": "Yamamoto 5\" Senko — Green Pumpkin",
      "brand": "Yamamoto",
      "category": "Soft Plastic Fishing Baits",
      "description": "Salt-impregnated soft plastic stick bait. 5 inches. 10 per pack.",
      "url": "https://www.example.com/product/5-senko/?color=green-pumpkin",
      "canonicalUrl": "https://www.example.com/product/5-senko/",
      "imageUrl": "https://www.example.com/images/5-senko-green-pumpkin.jpg",
      "price": "7.99",
      "currency": "USD",
      "availability": "InStock",
      "color": "Green Pumpkin",
      "aggregateRating": {
        "ratingValue": 4.8,
        "reviewCount": 523
      },
      "additionalProperties": {
        "length": "5 inches",
        "material": "Salt-impregnated soft plastic",
        "targetSpecies": "Largemouth bass, Smallmouth bass",
        "quantityPerPack": 10
      }
    }
  ]
}

High-value fields to include:

Field	GEO Value
`sku`	Entity identifier
`gtin`	Highest-confidence product identifier for Shopping Graph
`brand`	Brand entity association
`category`	Topical classification
`canonicalUrl`	Stable entity URL
`availability`	Real-time commerce data
`aggregateRating`	Trust and recommendation signal
`additionalProperties`	Technical attribute data for comparison queries

Making the Manifest Discoverable

Apply all five of the following discovery tactics:

1. Stable public URL with correct MIME type

The manifest must return HTTP 200 with Content-Type: application/json. No authentication. No redirect chains. Target response time under 1 second.

2. Homepage <head> link reference

<link rel="alternate"
      type="application/json"
      href="/products-manifest.json"
      title="Product Catalog Manifest">

3. robots.txt manifest declaration

Manifest: https://www.example.com/products-manifest.json

4. XML Sitemap inclusion

<url>
  <loc>https://www.example.com/products-manifest.json</loc>
  <lastmod>2026-05-28</lastmod>
  <changefreq>daily</changefreq>
</url>

Once the manifest URL is in your sitemap, submit the sitemap to Google Search Console so Google indexes the manifest URL directly. Go to: Search Console → Sitemaps → Add a new sitemap → enter your sitemap URL and click Submit. This is the fastest path to getting the manifest picked up by Google's Shopping Graph. If you want to prioritize the manifest URL specifically, you can also submit it as a standalone URL via the URL Inspection tool in Search Console and request indexing.

5. Product page <head> back-reference

Add to every product page <head>:

<link rel="catalog" href="/products-manifest.json">

AI Discovery Landing Page

Create a dedicated page at /ai-catalog that serves as a machine-readable control portal. Include:

What the product manifest contains
How often it is refreshed
Links to your sitemap, manifest, and structured data examples
Schema documentation for your product types
Contact information for AI indexing inquiries

<!-- /ai-catalog -->
<h1>AI Product Catalog</h1>
<p>This page provides machine-readable access to the Example Baits product catalog.</p>
<ul>
  <li><a href="/products-manifest.json">Product Manifest (JSON)</a> — Updated daily</li>
  <li><a href="/sitemap.xml">XML Sitemap</a></li>
  <li>Total products: 847</li>
  <li>Schema used: schema.org/ProductGroup, schema.org/Product</li>
</ul>

4. XML Sitemap Requirements

Your sitemap must be current, accurate, and discoverable. Outdated sitemaps cause AI crawlers to miss new pages or waste crawl budget on deleted ones.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.example.com/product/5-senko/</loc>
    <lastmod>2026-05-20</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Best practices:

Submit to Google Search Console — this is also where AI crawlers look first
Keep lastmod accurate — incorrect dates reduce crawler trust
Use a sitemap index file if you have more than 50,000 URLs
Include image sitemaps for product photography
Reference in robots.txt: Sitemap: https://www.example.com/sitemap.xml

5. Core Web Vitals and Page Speed

While CWV are primarily a traditional SEO signal, slow pages harm GEO in two ways:

AI crawlers with time budgets may abandon slow pages mid-crawl
Poor user experience reduces brand sentiment signals that AI systems aggregate

GEO-relevant performance targets:

Metric	Target	Why It Matters for GEO
Largest Contentful Paint (LCP)	< 2.5s	Ensures content is visible before crawlers time out
Time to First Byte (TTFB)	< 600ms	Crawlers measure server responsiveness
Cumulative Layout Shift (CLS)	< 0.1	Content stability affects what the crawler captures
Total Blocking Time (TBT)	< 200ms	Blocked JS delays schema rendering

6. HTTPS and Canonical URLs

AI systems use canonicalUrl values to deduplicate entities and build authoritative entity graphs.

Requirements:

All pages must be served over HTTPS
<link rel="canonical"> must be present on every page
Canonical URLs must match the @id values in your JSON-LD schema
No redirects between the canonical URL and the page URL
Consistent trailing slash usage (choose one, never mix)

<!-- In <head> -->
<link rel="canonical" href="https://www.example.com/product/5-senko/" />

// In JSON-LD — must match exactly
"@id": "https://www.example.com/product/5-senko/#productgroup",
"url": "https://www.example.com/product/5-senko/"

Technical GEO Audit Checklist

Run this audit quarterly to verify your technical GEO foundation is sound:

Rendering:

All JSON-LD schema is in the initial HTML response (not injected by JS)
Key product data (name, price, availability) is server-rendered
FAQ content is server-rendered

Bot Access:

robots.txt explicitly allows GPTBot, ClaudeBot, PerplexityBot, Google-Extended
Cloudflare Bot Fight Mode is not blocking AI crawlers
WAF rules do not apply to known AI crawler user agents
Rate limiting exemptions exist for AI crawlers

Manifests and Discovery:

products-manifest.json returns HTTP 200 with Content-Type: application/json
Manifest is referenced from homepage <head>
Manifest is listed in robots.txt
Manifest URL is in the XML sitemap
Sitemap containing the manifest URL has been submitted to Google Search Console
Manifest URL has been inspected and indexing requested in Google Search Console URL Inspection tool
/ai-catalog landing page exists and is current
llms.txt exists and describes AI-accessible resources
robots.txt includes LLMs:, Manifest:, and MCP: directives (where applicable)

MCP Server (if applicable):

MCP endpoint is hosted on the primary domain (not a third-party subdomain)
.well-known/mcp.json returns HTTP 200 with Content-Type: application/json
Homepage <head> includes <link rel="service"> to MCP endpoint
llms.txt references the MCP endpoint with capability descriptions
robots.txt allows /mcp and declares MCP: directive
MCP landing page (/mcp-info or /ai-tools) exists and is crawlable
Submitted to Smithery.ai, Glama, PulseMCP, MCP.so
Every tool has a 50+ word description with example inputs and usage guidance
See full checklist: MCP Discovery

URLs and Canonicalization:

All pages have <link rel="canonical">
Canonical URLs match @id values in JSON-LD
Sitemap lastmod dates are accurate
Sitemap is submitted in Google Search Console

1. Rendering — Schema Must Be in the Initial HTML​

Server-Side Rendering (SSR) Requirements​

What AI Crawlers May Skip​

For React / Headless Platforms​

2. AI Bot Access — Do Not Block AI Crawlers​

Whitelist These AI Crawler User Agents​

robots.txt Configuration​

Cloudflare and WAF Settings​

Cloudflare WAF Rule for AI Bots​

3. Product Manifest Discovery​

Manifest File Structure​

Making the Manifest Discoverable​

AI Discovery Landing Page​

4. XML Sitemap Requirements​

5. Core Web Vitals and Page Speed​

6. HTTPS and Canonical URLs​

Technical GEO Audit Checklist​