Tracking Non-HTML Files
AI crawlers almost always fetch robots.txt and sitemap.xml before they ever touch an HTML page.
The standard JavaScript snippet can't run inside those files — so here's how to capture those visits with a server-side tracking call instead.
01Why non-HTML tracking matters
When GPTBot or Perplexitybot discovers your site, the very first request it makes is to robots.txt — before it crawls a single page. Sitemap.xml is next. If you only have the JavaScript pixel on your HTML pages, you are missing the first and most consistent signal of an AI crawl.
Files like llms.txt and ai.txt
are read specifically by LLM crawlers as instructions. Knowing when an AI model hits those files
tells you the model is actively assessing your site for ingestion — that is a high-value signal.
02How server-side tracking works
Instead of an HTML <img> tag, your server fires a background HTTP GET
request to the tracking endpoint whenever the target file is served.
The request carries the same information — page path, User-Agent, IP — so the visit
appears in your dashboard exactly like a standard pixel hit.
The endpoint to call:
https://aeofix.com/api/bot-pixel?page=PATH&site=YOUR_SITE_ID
- 1Replace
PATHwith the file being served — e.g./robots.txt - 2Replace
YOUR_SITE_IDwith your site ID from the dashboard (e.g.beta-0a5b80b228bd) - 3Forward the visitor's
User-Agentheader in the tracking request — this is how the bot is identified - 4Fire the request in the background (non-blocking) so it doesn't slow down the file response
User-Agent header.
If you send your server's own UA, the bot won't be detected — the endpoint identifies bots by their UA string.
03Which files to track
| File | Why it matters | Priority |
|---|---|---|
/robots.txt |
First file every bot fetches. Tells you the crawl started. | Critical |
/sitemap.xml |
Bots use this to discover your pages. Hit = active crawl session. | Critical |
/llms.txt |
LLM-specific instructions file. Only AI crawlers read this. | Critical |
/ai.txt |
Alternative AI instructions file. Same signal as llms.txt. | High |
/feed.xml / /rss.xml |
Content feeds. AI training crawlers frequently ingest RSS. | High |
/sitemap_index.xml |
Root sitemap index on large sites. | High |
04WordPress
Add this to your theme's functions.php or a site-specific plugin.
It hooks into WordPress's robots.txt and sitemap filters and fires a background tracking request.
// AEOfix Bot Tracker — non-HTML file tracking
// Add to functions.php or a custom plugin
define('AEOFIX_SITE_ID', 'YOUR_SITE_ID'); // ← replace this
function aeofix_track_file($path) {
$ua = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
$url = 'https://aeofix.com/api/bot-pixel?page='
. urlencode($path)
. '&site=' . urlencode(AEOFIX_SITE_ID);
// Fire and forget — non-blocking
wp_remote_get($url, [
'timeout' => 1,
'blocking' => false,
'user-agent' => $ua,
'httpversion' => '1.1',
]);
}
// Track robots.txt
add_filter('robots_txt', function($output) {
aeofix_track_file('/robots.txt');
return $output;
});
// Track sitemap requests (Yoast, Rank Math, WP core)
add_action('template_redirect', function() {
$uri = $_SERVER['REQUEST_URI'] ?? '';
if (str_contains($uri, 'sitemap') || str_ends_with($uri, '.xml')) {
aeofix_track_file(strtok($uri, '?'));
}
});
// Track llms.txt / ai.txt if served as static files via WordPress
add_action('init', function() {
$uri = $_SERVER['REQUEST_URI'] ?? '';
if (in_array($uri, ['/llms.txt', '/ai.txt', '/feed/', '/feed/rss2/'])) {
aeofix_track_file(strtok($uri, '?'));
}
});
05Cloudflare Workers
If your site is proxied through Cloudflare, a Worker intercepts every request at the edge — no server access needed. Create a Worker in the Cloudflare dashboard and add a route for your domain.
const SITE_ID = 'YOUR_SITE_ID'; // ← replace this
const TRACKED_PATHS = new Set([
'/robots.txt',
'/sitemap.xml',
'/sitemap_index.xml',
'/llms.txt',
'/ai.txt',
'/feed.xml',
'/rss.xml',
]);
export default {
async fetch(request, env, ctx) {
const url = new URL(request.url);
const path = url.pathname;
// Fire tracking in background — doesn't slow down the response
if (TRACKED_PATHS.has(path) || path.includes('sitemap')) {
const pixelUrl = `https://aeofix.com/api/bot-pixel?page=${encodeURIComponent(path)}&site=${SITE_ID}`;
ctx.waitUntil(
fetch(pixelUrl, {
headers: {
'user-agent': request.headers.get('user-agent') || '',
'x-forwarded-for': request.headers.get('cf-connecting-ip') || '',
},
}).catch(() => {})
);
}
// Always pass through to origin
return fetch(request);
},
};
- 1Go to Workers & Pages in your Cloudflare dashboard
- 2Create a new Worker, paste the code above, replace
YOUR_SITE_ID - 3Add a route:
yourdomain.com/*→ point to this Worker
06Nginx
Use post_action to fire a subrequest after the file is served.
Add this inside your server {} block.
# Internal proxy location — not publicly accessible
location = /_aeofix_track {
internal;
proxy_pass https://aeofix.com/api/bot-pixel;
proxy_set_header Host aeofix.com;
proxy_set_header User-Agent $http_user_agent;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_pass_request_body off;
proxy_set_header Content-Length "";
}
# robots.txt
location = /robots.txt {
try_files $uri =404;
post_action /_aeofix_track?page=/robots.txt&site=YOUR_SITE_ID;
}
# sitemap.xml
location = /sitemap.xml {
try_files $uri =404;
post_action /_aeofix_track?page=/sitemap.xml&site=YOUR_SITE_ID;
}
# llms.txt
location = /llms.txt {
try_files $uri =404;
post_action /_aeofix_track?page=/llms.txt&site=YOUR_SITE_ID;
}
Replace YOUR_SITE_ID in each post_action line. Run nginx -t && nginx -s reload after editing.
07Apache / .htaccess
Apache doesn't have a native background-request feature, so a tiny PHP proxy file handles the tracking call.
Step 1 Create the proxy file
<?php
// Place this file in your site root as aeofix-track.php
$page = $_GET['page'] ?? '';
$site = 'YOUR_SITE_ID'; // ← replace this
$ua = $_SERVER['HTTP_X_UA'] ?? '';
$url = 'https://aeofix.com/api/bot-pixel?page='
. urlencode($page) . '&site=' . urlencode($site);
$ctx = stream_context_create(['http' => [
'method' => 'GET',
'header' => 'User-Agent: ' . addslashes($ua) . "\r\n",
'timeout' => 2,
'ignore_errors' => true,
]]);
@file_get_contents($url, false, $ctx);
http_response_code(204);
exit;
Step 2 Add rewrite rules to .htaccess
# AEOfix non-HTML bot tracking
RewriteEngine On
# robots.txt — serve file AND fire tracker
RewriteRule ^robots\.txt$ - [E=TRACK_PAGE:/robots.txt,L]
RewriteCond %{ENV:TRACK_PAGE} .+
RewriteRule .* /aeofix-track.php?page=%{ENV:TRACK_PAGE} [QSA,E=HTTP_X_UA:%{HTTP_USER_AGENT},NS]
# sitemap.xml
RewriteRule ^sitemap.*\.xml$ - [E=TRACK_PAGE:/$0,L]
RewriteCond %{ENV:TRACK_PAGE} .+
RewriteRule .* /aeofix-track.php?page=%{ENV:TRACK_PAGE} [QSA,NS]
# llms.txt
RewriteRule ^llms\.txt$ - [E=TRACK_PAGE:/llms.txt,L]
RewriteCond %{ENV:TRACK_PAGE} .+
RewriteRule .* /aeofix-track.php?page=%{ENV:TRACK_PAGE} [QSA,NS]
08Node.js / Next.js
For Next.js App Router, add this to middleware.ts in your project root.
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';
const SITE_ID = 'YOUR_SITE_ID'; // ← replace this
const TRACKED = new Set([
'/robots.txt', '/sitemap.xml', '/sitemap_index.xml',
'/llms.txt', '/ai.txt', '/feed.xml',
]);
export function middleware(request: NextRequest) {
const path = request.nextUrl.pathname;
if (TRACKED.has(path) || path.includes('sitemap')) {
const ua = request.headers.get('user-agent') || '';
const ip = request.headers.get('x-forwarded-for') || '';
const url = `https://aeofix.com/api/bot-pixel?page=${encodeURIComponent(path)}&site=${SITE_ID}`;
// Non-blocking — doesn't delay the response
fetch(url, {
headers: { 'user-agent': ua, 'x-forwarded-for': ip },
}).catch(() => {});
}
return NextResponse.next();
}
export const config = {
matcher: ['/robots.txt', '/sitemap:path*', '/llms.txt', '/ai.txt', '/feed.xml'],
};
09PHP (generic)
For any PHP site, add one line at the top of the file that serves robots.txt or sitemap.xml.
<?php
// Drop this at the top of robots.php, sitemap.php, feed.php, etc.
aeofix_track('/robots.txt');
function aeofix_track($page) {
$ua = $_SERVER['HTTP_USER_AGENT'] ?? '';
$url = 'https://aeofix.com/api/bot-pixel?page='
. urlencode($page) . '&site=YOUR_SITE_ID';
$ctx = stream_context_create(['http' => [
'method' => 'GET',
'header' => 'User-Agent: ' . $ua,
'timeout' => 1,
'ignore_errors' => true,
]]);
@file_get_contents($url, false, $ctx);
}
10Vercel (middleware.js)
For Vercel-hosted sites, add tracking in middleware.js at your project root. Uses waitUntil to fire after the response.
import { NextResponse } from 'next/server';
const SITE_ID = 'YOUR_SITE_ID'; // ← replace this
const TRACKED = new Set([
'/robots.txt', '/sitemap.xml', '/llms.txt', '/ai.txt',
]);
export default async function middleware(request, event) {
const path = new URL(request.url).pathname;
if (TRACKED.has(path)) {
const pixelUrl =
`https://aeofix.com/api/bot-pixel?page=${encodeURIComponent(path)}&site=${SITE_ID}`;
event.waitUntil(
fetch(pixelUrl, {
headers: {
'user-agent': request.headers.get('user-agent') || '',
'x-forwarded-for': request.headers.get('x-forwarded-for') || '',
},
}).catch(() => {})
);
}
return NextResponse.next();
}
export const config = {
matcher: ['/robots.txt', '/sitemap.xml', '/llms.txt', '/ai.txt'],
};
✓Quick reference
| Platform | Method | Blocking? |
|---|---|---|
| WordPress | wp_remote_get with 'blocking' => false |
No |
| Cloudflare | ctx.waitUntil(fetch(...)) |
No |
| Nginx | post_action subrequest |
No |
| Apache | PHP proxy with timeout=1 |
~1s max |
| Next.js | fetch(...).catch() (fire-and-forget) |
No |
| PHP | file_get_contents with timeout=1 |
~1s max |
| Vercel | event.waitUntil(fetch(...)) |
No |