Skip to main content

Technical Setup for GEO

Content and schema matter only if AI crawlers can actually access, render, and index your site correctly. This page covers the technical configuration layer of GEO: rendering, crawlability, AI bot access, and product manifest discovery.


1. Rendering — Schema Must Be in the Initial HTML

AI crawlers may not execute JavaScript, and even those that do may not wait for async rendering to complete. If your structured data or key content is injected after page load by JavaScript, it may be invisible to AI systems.

Server-Side Rendering (SSR) Requirements

Content TypeMust be SSRNotes
JSON-LD schema blocksYesNever inject via useEffect or async JS
Page title and meta descriptionYesRequired for AI snippet extraction
Product name, price, availabilityYesCore e-commerce data AI systems need
FAQ section contentYesMust be in initial HTML for FAQ schema to work
Review and rating contentYesAggregate ratings must be server-rendered
Breadcrumb navigationYesSchema and visible crumbs must match
Open Graph / Twitter Card tagsYesUsed by social AI and link previews

What AI Crawlers May Skip

  • Content loaded via lazy loading (IntersectionObserver) if not pre-rendered
  • Data fetched from APIs on the client side (prices from a headless API, for example)
  • Content inside <iframe> elements
  • Content gated behind login or cookie consent walls
  • Schema injected via Google Tag Manager (GTM is not reliable for AI crawlers)

For React / Headless Platforms

Use SSR or static site generation (SSG) frameworks:

  • Next.js — Use getServerSideProps or getStaticProps for product data
  • Gatsby — Static generation of product pages at build time
  • Nuxt.js — Server-side rendering mode
  • WooCommerce — Standard WordPress PHP rendering is SSR by default
  • BigCommerce — Stencil themes render server-side; ensure schema is in {{inject}} calls, not post-load JS

2. AI Bot Access — Do Not Block AI Crawlers

A significant percentage of sites unintentionally block AI crawlers through WAF rules, bot protection settings, or overly aggressive robots.txt configurations. This is invisible to humans but eliminates GEO potential entirely.

Whitelist These AI Crawler User Agents

Ensure these user agents are explicitly allowed in your WAF, CDN, and robots.txt:

Bot NameUser AgentSystem
Google's AI crawlerGooglebotGoogle AI Overviews, Shopping Graph
Google's extended botGoogle-ExtendedGoogle Gemini training data
OpenAI's crawlerGPTBotChatGPT, OpenAI systems
OpenAI's realtime crawlerChatGPT-UserChatGPT search
Anthropic's crawlerClaudeBotClaude AI
Perplexity's crawlerPerplexityBotPerplexity AI
Meta's crawlerMeta-ExternalAgentMeta AI
Common CrawlCCBotUsed by many AI training datasets

robots.txt Configuration

User-agent: *
Allow: /

# Explicitly allow all major AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

Sitemap: https://www.example.com/sitemap.xml

# AI resource declarations (emerging standards)
LLMs: https://www.example.com/llms.txt
Manifest: https://www.example.com/products-manifest.json
MCP: https://www.example.com/mcp

If your site has an MCP server, also add:

Allow: /mcp
Allow: /.well-known/mcp.json

For full MCP discovery setup — including .well-known/mcp.json, llms.txt references, registry submission, and tool descriptions — see the dedicated MCP Discovery page.

Cloudflare and WAF Settings

If you use Cloudflare or another WAF:

  1. Cloudflare Bot Fight Mode — If enabled, may challenge or block legitimate AI crawlers. Audit your Bot Analytics dashboard and create explicit allow rules for AI bot user agents.
  2. Security Level settings — "High" or "I'm Under Attack" modes can block crawlers. Use WAF rules instead of global security level increases.
  3. Rate limiting rules — Do not apply rate limits to the user agents listed above.
  4. Challenge pages — CAPTCHA and JS challenge pages are invisible to AI crawlers. Create bypass rules for known AI bots.

Cloudflare WAF Rule for AI Bots

(http.user_agent contains "GPTBot" or 
http.user_agent contains "ClaudeBot" or
http.user_agent contains "PerplexityBot" or
http.user_agent contains "Google-Extended") → Skip → All WAF Rules

3. Product Manifest Discovery

A product manifest is a machine-readable JSON file at a stable URL that lists your entire product catalog with structured metadata. It is an emerging standard for AI product discovery that ChatGPT Search, Perplexity, and other AI commerce systems are beginning to actively look for.

Manifest File Structure

Host at: https://www.example.com/products-manifest.json

{
"version": "1.0",
"generated": "2026-05-28T00:00:00Z",
"site": "https://www.example.com",
"totalProducts": 847,
"products": [
{
"sku": "YAM-5S-GP",
"gtin": "047925210175",
"name": "Yamamoto 5\" Senko — Green Pumpkin",
"brand": "Yamamoto",
"category": "Soft Plastic Fishing Baits",
"description": "Salt-impregnated soft plastic stick bait. 5 inches. 10 per pack.",
"url": "https://www.example.com/product/5-senko/?color=green-pumpkin",
"canonicalUrl": "https://www.example.com/product/5-senko/",
"imageUrl": "https://www.example.com/images/5-senko-green-pumpkin.jpg",
"price": "7.99",
"currency": "USD",
"availability": "InStock",
"color": "Green Pumpkin",
"aggregateRating": {
"ratingValue": 4.8,
"reviewCount": 523
},
"additionalProperties": {
"length": "5 inches",
"material": "Salt-impregnated soft plastic",
"targetSpecies": "Largemouth bass, Smallmouth bass",
"quantityPerPack": 10
}
}
]
}

High-value fields to include:

FieldGEO Value
skuEntity identifier
gtinHighest-confidence product identifier for Shopping Graph
brandBrand entity association
categoryTopical classification
canonicalUrlStable entity URL
availabilityReal-time commerce data
aggregateRatingTrust and recommendation signal
additionalPropertiesTechnical attribute data for comparison queries

Making the Manifest Discoverable

Apply all five of the following discovery tactics:

1. Stable public URL with correct MIME type

The manifest must return HTTP 200 with Content-Type: application/json. No authentication. No redirect chains. Target response time under 1 second.

2. Homepage <head> link reference

<link rel="alternate"
type="application/json"
href="/products-manifest.json"
title="Product Catalog Manifest">

3. robots.txt manifest declaration

Manifest: https://www.example.com/products-manifest.json

4. XML Sitemap inclusion

<url>
<loc>https://www.example.com/products-manifest.json</loc>
<lastmod>2026-05-28</lastmod>
<changefreq>daily</changefreq>
</url>

Once the manifest URL is in your sitemap, submit the sitemap to Google Search Console so Google indexes the manifest URL directly. Go to: Search Console → Sitemaps → Add a new sitemap → enter your sitemap URL and click Submit. This is the fastest path to getting the manifest picked up by Google's Shopping Graph. If you want to prioritize the manifest URL specifically, you can also submit it as a standalone URL via the URL Inspection tool in Search Console and request indexing.

5. Product page <head> back-reference

Add to every product page <head>:

<link rel="catalog" href="/products-manifest.json">

AI Discovery Landing Page

Create a dedicated page at /ai-catalog that serves as a machine-readable control portal. Include:

  • What the product manifest contains
  • How often it is refreshed
  • Links to your sitemap, manifest, and structured data examples
  • Schema documentation for your product types
  • Contact information for AI indexing inquiries
<!-- /ai-catalog -->
<h1>AI Product Catalog</h1>
<p>This page provides machine-readable access to the Example Baits product catalog.</p>
<ul>
<li><a href="/products-manifest.json">Product Manifest (JSON)</a> — Updated daily</li>
<li><a href="/sitemap.xml">XML Sitemap</a></li>
<li>Total products: 847</li>
<li>Schema used: schema.org/ProductGroup, schema.org/Product</li>
</ul>

4. XML Sitemap Requirements

Your sitemap must be current, accurate, and discoverable. Outdated sitemaps cause AI crawlers to miss new pages or waste crawl budget on deleted ones.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.example.com/product/5-senko/</loc>
<lastmod>2026-05-20</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset>

Best practices:

  • Submit to Google Search Console — this is also where AI crawlers look first
  • Keep lastmod accurate — incorrect dates reduce crawler trust
  • Use a sitemap index file if you have more than 50,000 URLs
  • Include image sitemaps for product photography
  • Reference in robots.txt: Sitemap: https://www.example.com/sitemap.xml

5. Core Web Vitals and Page Speed

While CWV are primarily a traditional SEO signal, slow pages harm GEO in two ways:

  1. AI crawlers with time budgets may abandon slow pages mid-crawl
  2. Poor user experience reduces brand sentiment signals that AI systems aggregate

GEO-relevant performance targets:

MetricTargetWhy It Matters for GEO
Largest Contentful Paint (LCP)< 2.5sEnsures content is visible before crawlers time out
Time to First Byte (TTFB)< 600msCrawlers measure server responsiveness
Cumulative Layout Shift (CLS)< 0.1Content stability affects what the crawler captures
Total Blocking Time (TBT)< 200msBlocked JS delays schema rendering

6. HTTPS and Canonical URLs

AI systems use canonicalUrl values to deduplicate entities and build authoritative entity graphs.

Requirements:

  • All pages must be served over HTTPS
  • <link rel="canonical"> must be present on every page
  • Canonical URLs must match the @id values in your JSON-LD schema
  • No redirects between the canonical URL and the page URL
  • Consistent trailing slash usage (choose one, never mix)
<!-- In <head> -->
<link rel="canonical" href="https://www.example.com/product/5-senko/" />
// In JSON-LD — must match exactly
"@id": "https://www.example.com/product/5-senko/#productgroup",
"url": "https://www.example.com/product/5-senko/"

Technical GEO Audit Checklist

Run this audit quarterly to verify your technical GEO foundation is sound:

Rendering:

  • All JSON-LD schema is in the initial HTML response (not injected by JS)
  • Key product data (name, price, availability) is server-rendered
  • FAQ content is server-rendered

Bot Access:

  • robots.txt explicitly allows GPTBot, ClaudeBot, PerplexityBot, Google-Extended
  • Cloudflare Bot Fight Mode is not blocking AI crawlers
  • WAF rules do not apply to known AI crawler user agents
  • Rate limiting exemptions exist for AI crawlers

Manifests and Discovery:

  • products-manifest.json returns HTTP 200 with Content-Type: application/json
  • Manifest is referenced from homepage <head>
  • Manifest is listed in robots.txt
  • Manifest URL is in the XML sitemap
  • Sitemap containing the manifest URL has been submitted to Google Search Console
  • Manifest URL has been inspected and indexing requested in Google Search Console URL Inspection tool
  • /ai-catalog landing page exists and is current
  • llms.txt exists and describes AI-accessible resources
  • robots.txt includes LLMs:, Manifest:, and MCP: directives (where applicable)

MCP Server (if applicable):

  • MCP endpoint is hosted on the primary domain (not a third-party subdomain)
  • .well-known/mcp.json returns HTTP 200 with Content-Type: application/json
  • Homepage <head> includes <link rel="service"> to MCP endpoint
  • llms.txt references the MCP endpoint with capability descriptions
  • robots.txt allows /mcp and declares MCP: directive
  • MCP landing page (/mcp-info or /ai-tools) exists and is crawlable
  • Submitted to Smithery.ai, Glama, PulseMCP, MCP.so
  • Every tool has a 50+ word description with example inputs and usage guidance
  • See full checklist: MCP Discovery

URLs and Canonicalization:

  • All pages have <link rel="canonical">
  • Canonical URLs match @id values in JSON-LD
  • Sitemap lastmod dates are accurate
  • Sitemap is submitted in Google Search Console