> DOCS / ADVANCED_TRACKING

Tracking Non-HTML Files

AI crawlers almost always fetch robots.txt and sitemap.xml before they ever touch an HTML page. The standard JavaScript snippet can't run inside those files — so here's how to capture those visits with a server-side tracking call instead.

01Why non-HTML tracking matters

When GPTBot or Perplexitybot discovers your site, the very first request it makes is to robots.txt — before it crawls a single page. Sitemap.xml is next. If you only have the JavaScript pixel on your HTML pages, you are missing the first and most consistent signal of an AI crawl.

Files like llms.txt and ai.txt are read specifically by LLM crawlers as instructions. Knowing when an AI model hits those files tells you the model is actively assessing your site for ingestion — that is a high-value signal.

Bottom line: The JavaScript snippet covers HTML pages. Server-side tracking covers everything else — and everything else is where AI crawlers announce themselves first.

02How server-side tracking works

Instead of an HTML <img> tag, your server fires a background HTTP GET request to the tracking endpoint whenever the target file is served. The request carries the same information — page path, User-Agent, IP — so the visit appears in your dashboard exactly like a standard pixel hit.

The endpoint to call:

URL
https://aeofix.com/api/bot-pixel?page=PATH&site=YOUR_SITE_ID
  • 1Replace PATH with the file being served — e.g. /robots.txt
  • 2Replace YOUR_SITE_ID with your site ID from the dashboard (e.g. beta-0a5b80b228bd)
  • 3Forward the visitor's User-Agent header in the tracking request — this is how the bot is identified
  • 4Fire the request in the background (non-blocking) so it doesn't slow down the file response
Important: Always forward the original visitor's User-Agent header. If you send your server's own UA, the bot won't be detected — the endpoint identifies bots by their UA string.

03Which files to track

File Why it matters Priority
/robots.txt First file every bot fetches. Tells you the crawl started. Critical
/sitemap.xml Bots use this to discover your pages. Hit = active crawl session. Critical
/llms.txt LLM-specific instructions file. Only AI crawlers read this. Critical
/ai.txt Alternative AI instructions file. Same signal as llms.txt. High
/feed.xml / /rss.xml Content feeds. AI training crawlers frequently ingest RSS. High
/sitemap_index.xml Root sitemap index on large sites. High

04WordPress

Add this to your theme's functions.php or a site-specific plugin. It hooks into WordPress's robots.txt and sitemap filters and fires a background tracking request.

PHP — functions.php
// AEOfix Bot Tracker — non-HTML file tracking
// Add to functions.php or a custom plugin

define('AEOFIX_SITE_ID', 'YOUR_SITE_ID'); // ← replace this

function aeofix_track_file($path) {
    $ua  = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : '';
    $url = 'https://aeofix.com/api/bot-pixel?page='
           . urlencode($path)
           . '&site=' . urlencode(AEOFIX_SITE_ID);

    // Fire and forget — non-blocking
    wp_remote_get($url, [
        'timeout'     => 1,
        'blocking'    => false,
        'user-agent'  => $ua,
        'httpversion' => '1.1',
    ]);
}

// Track robots.txt
add_filter('robots_txt', function($output) {
    aeofix_track_file('/robots.txt');
    return $output;
});

// Track sitemap requests (Yoast, Rank Math, WP core)
add_action('template_redirect', function() {
    $uri = $_SERVER['REQUEST_URI'] ?? '';
    if (str_contains($uri, 'sitemap') || str_ends_with($uri, '.xml')) {
        aeofix_track_file(strtok($uri, '?'));
    }
});

// Track llms.txt / ai.txt if served as static files via WordPress
add_action('init', function() {
    $uri = $_SERVER['REQUEST_URI'] ?? '';
    if (in_array($uri, ['/llms.txt', '/ai.txt', '/feed/', '/feed/rss2/'])) {
        aeofix_track_file(strtok($uri, '?'));
    }
});

05Cloudflare Workers

If your site is proxied through Cloudflare, a Worker intercepts every request at the edge — no server access needed. Create a Worker in the Cloudflare dashboard and add a route for your domain.

JavaScript — Cloudflare Worker
const SITE_ID = 'YOUR_SITE_ID'; // ← replace this

const TRACKED_PATHS = new Set([
  '/robots.txt',
  '/sitemap.xml',
  '/sitemap_index.xml',
  '/llms.txt',
  '/ai.txt',
  '/feed.xml',
  '/rss.xml',
]);

export default {
  async fetch(request, env, ctx) {
    const url  = new URL(request.url);
    const path = url.pathname;

    // Fire tracking in background — doesn't slow down the response
    if (TRACKED_PATHS.has(path) || path.includes('sitemap')) {
      const pixelUrl = `https://aeofix.com/api/bot-pixel?page=${encodeURIComponent(path)}&site=${SITE_ID}`;
      ctx.waitUntil(
        fetch(pixelUrl, {
          headers: {
            'user-agent': request.headers.get('user-agent') || '',
            'x-forwarded-for': request.headers.get('cf-connecting-ip') || '',
          },
        }).catch(() => {})
      );
    }

    // Always pass through to origin
    return fetch(request);
  },
};
  • 1Go to Workers & Pages in your Cloudflare dashboard
  • 2Create a new Worker, paste the code above, replace YOUR_SITE_ID
  • 3Add a route: yourdomain.com/* → point to this Worker

06Nginx

Use post_action to fire a subrequest after the file is served. Add this inside your server {} block.

Nginx — nginx.conf
# Internal proxy location — not publicly accessible
location = /_aeofix_track {
    internal;
    proxy_pass          https://aeofix.com/api/bot-pixel;
    proxy_set_header    Host aeofix.com;
    proxy_set_header    User-Agent $http_user_agent;
    proxy_set_header    X-Forwarded-For $remote_addr;
    proxy_pass_request_body off;
    proxy_set_header    Content-Length "";
}

# robots.txt
location = /robots.txt {
    try_files $uri =404;
    post_action /_aeofix_track?page=/robots.txt&site=YOUR_SITE_ID;
}

# sitemap.xml
location = /sitemap.xml {
    try_files $uri =404;
    post_action /_aeofix_track?page=/sitemap.xml&site=YOUR_SITE_ID;
}

# llms.txt
location = /llms.txt {
    try_files $uri =404;
    post_action /_aeofix_track?page=/llms.txt&site=YOUR_SITE_ID;
}

Replace YOUR_SITE_ID in each post_action line. Run nginx -t && nginx -s reload after editing.

07Apache / .htaccess

Apache doesn't have a native background-request feature, so a tiny PHP proxy file handles the tracking call.

Step 1 Create the proxy file

PHP — /aeofix-track.php
<?php
// Place this file in your site root as aeofix-track.php
$page    = $_GET['page'] ?? '';
$site    = 'YOUR_SITE_ID'; // ← replace this
$ua      = $_SERVER['HTTP_X_UA'] ?? '';
$url     = 'https://aeofix.com/api/bot-pixel?page='
           . urlencode($page) . '&site=' . urlencode($site);

$ctx = stream_context_create(['http' => [
    'method'     => 'GET',
    'header'     => 'User-Agent: ' . addslashes($ua) . "\r\n",
    'timeout'    => 2,
    'ignore_errors' => true,
]]);
@file_get_contents($url, false, $ctx);
http_response_code(204);
exit;

Step 2 Add rewrite rules to .htaccess

Apache — .htaccess
# AEOfix non-HTML bot tracking
RewriteEngine On

# robots.txt — serve file AND fire tracker
RewriteRule ^robots\.txt$ - [E=TRACK_PAGE:/robots.txt,L]
RewriteCond %{ENV:TRACK_PAGE} .+
RewriteRule .* /aeofix-track.php?page=%{ENV:TRACK_PAGE} [QSA,E=HTTP_X_UA:%{HTTP_USER_AGENT},NS]

# sitemap.xml
RewriteRule ^sitemap.*\.xml$ - [E=TRACK_PAGE:/$0,L]
RewriteCond %{ENV:TRACK_PAGE} .+
RewriteRule .* /aeofix-track.php?page=%{ENV:TRACK_PAGE} [QSA,NS]

# llms.txt
RewriteRule ^llms\.txt$ - [E=TRACK_PAGE:/llms.txt,L]
RewriteCond %{ENV:TRACK_PAGE} .+
RewriteRule .* /aeofix-track.php?page=%{ENV:TRACK_PAGE} [QSA,NS]

08Node.js / Next.js

For Next.js App Router, add this to middleware.ts in your project root.

TypeScript — middleware.ts (Next.js)
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';

const SITE_ID = 'YOUR_SITE_ID'; // ← replace this

const TRACKED = new Set([
  '/robots.txt', '/sitemap.xml', '/sitemap_index.xml',
  '/llms.txt', '/ai.txt', '/feed.xml',
]);

export function middleware(request: NextRequest) {
  const path = request.nextUrl.pathname;

  if (TRACKED.has(path) || path.includes('sitemap')) {
    const ua  = request.headers.get('user-agent') || '';
    const ip  = request.headers.get('x-forwarded-for') || '';
    const url = `https://aeofix.com/api/bot-pixel?page=${encodeURIComponent(path)}&site=${SITE_ID}`;

    // Non-blocking — doesn't delay the response
    fetch(url, {
      headers: { 'user-agent': ua, 'x-forwarded-for': ip },
    }).catch(() => {});
  }

  return NextResponse.next();
}

export const config = {
  matcher: ['/robots.txt', '/sitemap:path*', '/llms.txt', '/ai.txt', '/feed.xml'],
};

09PHP (generic)

For any PHP site, add one line at the top of the file that serves robots.txt or sitemap.xml.

PHP — any file
<?php
// Drop this at the top of robots.php, sitemap.php, feed.php, etc.
aeofix_track('/robots.txt');

function aeofix_track($page) {
    $ua  = $_SERVER['HTTP_USER_AGENT'] ?? '';
    $url = 'https://aeofix.com/api/bot-pixel?page='
           . urlencode($page) . '&site=YOUR_SITE_ID';
    $ctx = stream_context_create(['http' => [
        'method'  => 'GET',
        'header'  => 'User-Agent: ' . $ua,
        'timeout' => 1,
        'ignore_errors' => true,
    ]]);
    @file_get_contents($url, false, $ctx);
}

10Vercel (middleware.js)

For Vercel-hosted sites, add tracking in middleware.js at your project root. Uses waitUntil to fire after the response.

JavaScript — middleware.js (Vercel Edge)
import { NextResponse } from 'next/server';

const SITE_ID  = 'YOUR_SITE_ID'; // ← replace this
const TRACKED = new Set([
  '/robots.txt', '/sitemap.xml', '/llms.txt', '/ai.txt',
]);

export default async function middleware(request, event) {
  const path = new URL(request.url).pathname;

  if (TRACKED.has(path)) {
    const pixelUrl =
      `https://aeofix.com/api/bot-pixel?page=${encodeURIComponent(path)}&site=${SITE_ID}`;

    event.waitUntil(
      fetch(pixelUrl, {
        headers: {
          'user-agent':       request.headers.get('user-agent') || '',
          'x-forwarded-for':  request.headers.get('x-forwarded-for') || '',
        },
      }).catch(() => {})
    );
  }

  return NextResponse.next();
}

export const config = {
  matcher: ['/robots.txt', '/sitemap.xml', '/llms.txt', '/ai.txt'],
};

Quick reference

PlatformMethodBlocking?
WordPress wp_remote_get with 'blocking' => false No
Cloudflare ctx.waitUntil(fetch(...)) No
Nginx post_action subrequest No
Apache PHP proxy with timeout=1 ~1s max
Next.js fetch(...).catch() (fire-and-forget) No
PHP file_get_contents with timeout=1 ~1s max
Vercel event.waitUntil(fetch(...)) No