Technical Setup for GEO
Content and schema matter only if AI crawlers can actually access, render, and index your site correctly. This page covers the technical configuration layer of GEO: rendering, crawlability, AI bot access, and product manifest discovery.
1. Rendering — Schema Must Be in the Initial HTML
AI crawlers may not execute JavaScript, and even those that do may not wait for async rendering to complete. If your structured data or key content is injected after page load by JavaScript, it may be invisible to AI systems.
Server-Side Rendering (SSR) Requirements
| Content Type | Must be SSR | Notes |
|---|---|---|
| JSON-LD schema blocks | Yes | Never inject via useEffect or async JS |
| Page title and meta description | Yes | Required for AI snippet extraction |
| Product name, price, availability | Yes | Core e-commerce data AI systems need |
| FAQ section content | Yes | Must be in initial HTML for FAQ schema to work |
| Review and rating content | Yes | Aggregate ratings must be server-rendered |
| Breadcrumb navigation | Yes | Schema and visible crumbs must match |
| Open Graph / Twitter Card tags | Yes | Used by social AI and link previews |
What AI Crawlers May Skip
- Content loaded via lazy loading (
IntersectionObserver) if not pre-rendered - Data fetched from APIs on the client side (prices from a headless API, for example)
- Content inside
<iframe>elements - Content gated behind login or cookie consent walls
- Schema injected via Google Tag Manager (GTM is not reliable for AI crawlers)
For React / Headless Platforms
Use SSR or static site generation (SSG) frameworks:
- Next.js — Use
getServerSidePropsorgetStaticPropsfor product data - Gatsby — Static generation of product pages at build time
- Nuxt.js — Server-side rendering mode
- WooCommerce — Standard WordPress PHP rendering is SSR by default
- BigCommerce — Stencil themes render server-side; ensure schema is in
{{inject}}calls, not post-load JS
2. AI Bot Access — Do Not Block AI Crawlers
A significant percentage of sites unintentionally block AI crawlers through WAF rules, bot protection settings, or overly aggressive robots.txt configurations. This is invisible to humans but eliminates GEO potential entirely.
Whitelist These AI Crawler User Agents
Ensure these user agents are explicitly allowed in your WAF, CDN, and robots.txt:
| Bot Name | User Agent | System |
|---|---|---|
| Google's AI crawler | Googlebot | Google AI Overviews, Shopping Graph |
| Google's extended bot | Google-Extended | Google Gemini training data |
| OpenAI's crawler | GPTBot | ChatGPT, OpenAI systems |
| OpenAI's realtime crawler | ChatGPT-User | ChatGPT search |
| Anthropic's crawler | ClaudeBot | Claude AI |
| Perplexity's crawler | PerplexityBot | Perplexity AI |
| Meta's crawler | Meta-ExternalAgent | Meta AI |
| Common Crawl | CCBot | Used by many AI training datasets |
robots.txt Configuration
User-agent: *
Allow: /
# Explicitly allow all major AI crawlers
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
Sitemap: https://www.example.com/sitemap.xml
# AI resource declarations (emerging standards)
LLMs: https://www.example.com/llms.txt
Manifest: https://www.example.com/products-manifest.json
MCP: https://www.example.com/mcp
If your site has an MCP server, also add:
Allow: /mcp
Allow: /.well-known/mcp.json
For full MCP discovery setup — including .well-known/mcp.json, llms.txt references, registry submission, and tool descriptions — see the dedicated MCP Discovery page.
Cloudflare and WAF Settings
If you use Cloudflare or another WAF:
- Cloudflare Bot Fight Mode — If enabled, may challenge or block legitimate AI crawlers. Audit your Bot Analytics dashboard and create explicit allow rules for AI bot user agents.
- Security Level settings — "High" or "I'm Under Attack" modes can block crawlers. Use WAF rules instead of global security level increases.
- Rate limiting rules — Do not apply rate limits to the user agents listed above.
- Challenge pages — CAPTCHA and JS challenge pages are invisible to AI crawlers. Create bypass rules for known AI bots.
Cloudflare WAF Rule for AI Bots
(http.user_agent contains "GPTBot" or
http.user_agent contains "ClaudeBot" or
http.user_agent contains "PerplexityBot" or
http.user_agent contains "Google-Extended") → Skip → All WAF Rules
3. Product Manifest Discovery
A product manifest is a machine-readable JSON file at a stable URL that lists your entire product catalog with structured metadata. It is an emerging standard for AI product discovery that ChatGPT Search, Perplexity, and other AI commerce systems are beginning to actively look for.
Manifest File Structure
Host at: https://www.example.com/products-manifest.json
{
"version": "1.0",
"generated": "2026-05-28T00:00:00Z",
"site": "https://www.example.com",
"totalProducts": 847,
"products": [
{
"sku": "YAM-5S-GP",
"gtin": "047925210175",
"name": "Yamamoto 5\" Senko — Green Pumpkin",
"brand": "Yamamoto",
"category": "Soft Plastic Fishing Baits",
"description": "Salt-impregnated soft plastic stick bait. 5 inches. 10 per pack.",
"url": "https://www.example.com/product/5-senko/?color=green-pumpkin",
"canonicalUrl": "https://www.example.com/product/5-senko/",
"imageUrl": "https://www.example.com/images/5-senko-green-pumpkin.jpg",
"price": "7.99",
"currency": "USD",
"availability": "InStock",
"color": "Green Pumpkin",
"aggregateRating": {
"ratingValue": 4.8,
"reviewCount": 523
},
"additionalProperties": {
"length": "5 inches",
"material": "Salt-impregnated soft plastic",
"targetSpecies": "Largemouth bass, Smallmouth bass",
"quantityPerPack": 10
}
}
]
}
High-value fields to include:
| Field | GEO Value |
|---|---|
sku | Entity identifier |
gtin | Highest-confidence product identifier for Shopping Graph |
brand | Brand entity association |
category | Topical classification |
canonicalUrl | Stable entity URL |
availability | Real-time commerce data |
aggregateRating | Trust and recommendation signal |
additionalProperties | Technical attribute data for comparison queries |
Making the Manifest Discoverable
Apply all five of the following discovery tactics:
1. Stable public URL with correct MIME type
The manifest must return HTTP 200 with Content-Type: application/json. No authentication. No redirect chains. Target response time under 1 second.
2. Homepage <head> link reference
<link rel="alternate"
type="application/json"
href="/products-manifest.json"
title="Product Catalog Manifest">
3. robots.txt manifest declaration
Manifest: https://www.example.com/products-manifest.json
4. XML Sitemap inclusion
<url>
<loc>https://www.example.com/products-manifest.json</loc>
<lastmod>2026-05-28</lastmod>
<changefreq>daily</changefreq>
</url>
Once the manifest URL is in your sitemap, submit the sitemap to Google Search Console so Google indexes the manifest URL directly. Go to: Search Console → Sitemaps → Add a new sitemap → enter your sitemap URL and click Submit. This is the fastest path to getting the manifest picked up by Google's Shopping Graph. If you want to prioritize the manifest URL specifically, you can also submit it as a standalone URL via the URL Inspection tool in Search Console and request indexing.
5. Product page <head> back-reference
Add to every product page <head>:
<link rel="catalog" href="/products-manifest.json">
AI Discovery Landing Page
Create a dedicated page at /ai-catalog that serves as a machine-readable control portal. Include:
- What the product manifest contains
- How often it is refreshed
- Links to your sitemap, manifest, and structured data examples
- Schema documentation for your product types
- Contact information for AI indexing inquiries
<!-- /ai-catalog -->
<h1>AI Product Catalog</h1>
<p>This page provides machine-readable access to the Example Baits product catalog.</p>
<ul>
<li><a href="/products-manifest.json">Product Manifest (JSON)</a> — Updated daily</li>
<li><a href="/sitemap.xml">XML Sitemap</a></li>
<li>Total products: 847</li>
<li>Schema used: schema.org/ProductGroup, schema.org/Product</li>
</ul>
4. XML Sitemap Requirements
Your sitemap must be current, accurate, and discoverable. Outdated sitemaps cause AI crawlers to miss new pages or waste crawl budget on deleted ones.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.example.com/product/5-senko/</loc>
<lastmod>2026-05-20</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
Best practices:
- Submit to Google Search Console — this is also where AI crawlers look first
- Keep
lastmodaccurate — incorrect dates reduce crawler trust - Use a sitemap index file if you have more than 50,000 URLs
- Include image sitemaps for product photography
- Reference in
robots.txt:Sitemap: https://www.example.com/sitemap.xml
5. Core Web Vitals and Page Speed
While CWV are primarily a traditional SEO signal, slow pages harm GEO in two ways:
- AI crawlers with time budgets may abandon slow pages mid-crawl
- Poor user experience reduces brand sentiment signals that AI systems aggregate
GEO-relevant performance targets:
| Metric | Target | Why It Matters for GEO |
|---|---|---|
| Largest Contentful Paint (LCP) | < 2.5s | Ensures content is visible before crawlers time out |
| Time to First Byte (TTFB) | < 600ms | Crawlers measure server responsiveness |
| Cumulative Layout Shift (CLS) | < 0.1 | Content stability affects what the crawler captures |
| Total Blocking Time (TBT) | < 200ms | Blocked JS delays schema rendering |
6. HTTPS and Canonical URLs
AI systems use canonicalUrl values to deduplicate entities and build authoritative entity graphs.
Requirements:
- All pages must be served over HTTPS
<link rel="canonical">must be present on every page- Canonical URLs must match the
@idvalues in your JSON-LD schema - No redirects between the canonical URL and the page URL
- Consistent trailing slash usage (choose one, never mix)
<!-- In <head> -->
<link rel="canonical" href="https://www.example.com/product/5-senko/" />
// In JSON-LD — must match exactly
"@id": "https://www.example.com/product/5-senko/#productgroup",
"url": "https://www.example.com/product/5-senko/"
Technical GEO Audit Checklist
Run this audit quarterly to verify your technical GEO foundation is sound:
Rendering:
- All JSON-LD schema is in the initial HTML response (not injected by JS)
- Key product data (name, price, availability) is server-rendered
- FAQ content is server-rendered
Bot Access:
-
robots.txtexplicitly allowsGPTBot,ClaudeBot,PerplexityBot,Google-Extended - Cloudflare Bot Fight Mode is not blocking AI crawlers
- WAF rules do not apply to known AI crawler user agents
- Rate limiting exemptions exist for AI crawlers
Manifests and Discovery:
-
products-manifest.jsonreturns HTTP 200 withContent-Type: application/json - Manifest is referenced from homepage
<head> - Manifest is listed in
robots.txt - Manifest URL is in the XML sitemap
- Sitemap containing the manifest URL has been submitted to Google Search Console
- Manifest URL has been inspected and indexing requested in Google Search Console URL Inspection tool
-
/ai-cataloglanding page exists and is current -
llms.txtexists and describes AI-accessible resources -
robots.txtincludesLLMs:,Manifest:, andMCP:directives (where applicable)
MCP Server (if applicable):
- MCP endpoint is hosted on the primary domain (not a third-party subdomain)
-
.well-known/mcp.jsonreturns HTTP 200 withContent-Type: application/json - Homepage
<head>includes<link rel="service">to MCP endpoint -
llms.txtreferences the MCP endpoint with capability descriptions -
robots.txtallows/mcpand declaresMCP:directive - MCP landing page (
/mcp-infoor/ai-tools) exists and is crawlable - Submitted to Smithery.ai, Glama, PulseMCP, MCP.so
- Every tool has a 50+ word description with example inputs and usage guidance
- See full checklist: MCP Discovery
URLs and Canonicalization:
- All pages have
<link rel="canonical"> - Canonical URLs match
@idvalues in JSON-LD - Sitemap
lastmoddates are accurate - Sitemap is submitted in Google Search Console