My Web Crawler Crashed a Server at Los Alamos

In 1993, I wrote a web crawler that took down a server at Los Alamos National Laboratory. The sysadmin was furious. Thirty years later, companies worth billions are doing the same thing to the entire internet. On purpose.

TL;DR

Check your robots.txt. Then check your server logs. The AI crawlers ignoring it are the same pattern as my reckless spider in 1993, except these ones have billion-dollar budgets and legal teams.

Updated February 2026: Added links to related articles on AI content farms and vendor practices.

I was twenty-two, running this thing on some underpowered box I had lying around. The web was small enough that you could almost touch the edges. Maybe a few hundred websites. I'd been programming since the late seventies, had been a BBS sysop, had run servers from my house. When the World Wide Web showed up, the first thing I wanted to do was map it. So I wrote a crawler.

The logic was simple: start with a URL, fetch the page, extract every link, follow each one, repeat. Depth-first, no delay between requests, no rate limiting. I didn't know those concepts existed yet. Nobody did. Matthew Gray's World Wide Web Wanderer, created at MIT that same year, was doing roughly the same thing. We were all figuring it out as we went.

The Los Alamos Incident

My crawler hit Los Alamos because they were one of the early institutions on the web. National laboratories, universities, CERN. That was the web in 1993. My spider followed a link to their server and started doing what spiders do: pulling every page, following every link, requesting every resource it found.

The problem was recursion. Their server had some kind of dynamic path structure where links generated more links. My crawler didn't track where it had been. It didn't have a politeness delay. It just kept requesting, faster than a human ever would, following an expanding tree of URLs that may have been functionally infinite.

Their server couldn't handle it. Whatever machine they were running, it buckled under the load of my crawler hammering it without pause. We're talking about 1993. Most web servers were running on hardware that would embarrass a modern thermostat.

The angry email came fast. They'd traced my IP address from their server logs—in 1993, that was all it took. No proxies, no VPNs, no rotating residential IPs. Just a raw IP pointing straight back to my machine. Called me a neophyte. Said I didn't know what the hell I was doing. That my machine was flooding their server and if I didn't stop it immediately, they'd block my entire subnet.

I stopped it immediately. Wrote back a profuse apology. Promised I'd add a rate limiter and a visited-URL set before I ever ran the thing again. I meant it. I figured that was the end of it, just a twenty-two-year-old getting chewed out by a sysadmin who had every right to be furious.

It wasn't the end of it.

The Call That Changed Everything

Months later, the Los Alamos admin did an interview with a tech magazine. Mentioned the incident. Mentioned me. The web was small enough that crashing a server at a national laboratory was apparently newsworthy. And suddenly I was getting calls. Companies wanted to talk to the guy who'd written a crawler aggressive enough to take down Los Alamos. Not because I'd done something smart, but because I'd done something interesting when almost nobody was doing it at all.

One of those calls was from a manager at Microsoft. That call turned into a job offer. The kid who got called a neophyte by a furious sysadmin ended up building a CMS at MSNBC. I've thought about that sequence a lot over the years. The web was so new, so small, that even screwing up spectacularly put you on someone's radar. There weren't ten thousand people writing crawlers. Maybe a dozen. And most of them had crashed somebody's server too.

But beyond the lucky break, I'd learned something fundamental about the relationship between a crawler and a server. One badly behaved client could take down a host. The web had no immune system.

We Were All Writing Crawlers

I wasn't the only one causing problems. In early 1994, a programmer named Charles Stross wrote a crawler called Websnarf that overwhelmed a server run by Martijn Koster. According to Koster's own account, this incident was one of the catalysts for creating the Robots Exclusion Protocol. What we now call robots.txt.

Think about that. The entire system of web crawling etiquette, the protocol that every search engine and well-behaved bot has respected for three decades, exists because some programmer's spider crashed someone else's server. Just like mine crashed Los Alamos.

The origin of robots.txt: A programmer's crawler crashes a server. The server admin creates a protocol asking crawlers to behave. The protocol becomes a gentleman's agreement that holds for 30 years. Then AI companies show up.

Koster's solution was elegant in its simplicity. Put a text file at the root of your website. List the bots you don't want crawling certain paths. Any well-intentioned crawler checks for that file first and respects the rules. No enforcement mechanism. No authentication. Just a polite request: please don't go here.

It worked because in 1994, the web was a community. The same etiquette that governed BBS culture carried over. You didn't trash someone else's server for the same reason you didn't spam someone else's bulletin board. It was rude. And the community was small enough that being rude had consequences.

What My Crawler Taught Me About Scale

After the Los Alamos incident, I rewrote my crawler. Added a visited-URL set so it wouldn't chase infinite loops. Added delays between requests. Added respect for server response codes. Basic stuff, but none of it was obvious until I'd already caused damage.

Here's roughly what my crawler was doing, reconstructed from memory. 1993-era C, BSD sockets, HTTP/1.0:

/* crawler.c - what I was running in 1993 (every line is a lesson) */
void crawl(const char *url) {
    int sock = socket(AF_INET, SOCK_STREAM, 0);
    struct hostent *host = gethostbyname(hostname);
    struct sockaddr_in addr;
    addr.sin_family = AF_INET;
    addr.sin_port = htons(80);
    /* BUG: trusts h_length blindly — if it exceeds sizeof(struct in_addr),
       this smashes the stack. Always verify: host->h_length == 4 */
    memcpy(&addr.sin_addr, host->h_addr, host->h_length);

    connect(sock, (struct sockaddr *)&addr, sizeof(addr));

    char req[512];
    /* BUG: sprintf with no bounds check. If path + hostname > 512 bytes,
       buffer overflow. Use snprintf(req, sizeof(req), ...) always. */
    sprintf(req, "GET %s HTTP/1.0\r\nHost: %s\r\n\r\n", path, hostname);
    send(sock, req, strlen(req), 0);

    char buf[8192];
    int n = recv(sock, buf, sizeof(buf), 0);
    close(sock);

    /* extract every href, crawl each one immediately */
    char *p = buf;
    while ((p = strstr(p, "href=\"")) != NULL) {
        p += 6;
        char *end = strchr(p, '"');
        if (end) {
            *end = '\0';
            crawl(p);   /* recursive, no delay, no visited check */
            p = end + 1;
        }
    }
}

No sleep(). No visited set. No robots.txt check. Not even a User-Agent header. No bounds checking. Recursive descent into every link at wire speed. The sprintf alone could smash the stack if a URL was long enough. On a server with 32MB of RAM, this was a death sentence.

Here's what it should have been. Same era, same C, but with O(1) URL deduplication, bounds-checked I/O, and per-host politeness that doesn't block the entire process:

/* polite_crawler.c - what I rewrote it to */
#define CRAWL_DELAY_US  1000000  /* 1 second between same-host requests */
#define HASH_BUCKETS    65536    /* power of 2 for fast modulo */

/* --- O(1) visited-URL set using DJB2 hash --- */
typedef struct visited_node {
    char *url;
    struct visited_node *next;
} visited_node;

static visited_node *visited[HASH_BUCKETS];

static unsigned hash_url(const char *s) {
    unsigned h = 5381;
    while (*s) h = ((h << 5) + h) ^ *s++;
    return h & (HASH_BUCKETS - 1);
}

int already_visited(const char *url) {
    unsigned h = hash_url(url);
    for (visited_node *n = visited[h]; n; n = n->next)
        if (strcmp(n->url, url) == 0) return 1;
    visited_node *n = malloc(sizeof(*n));
    n->url = strdup(url);
    n->next = visited[h];
    visited[h] = n;
    return 0;
}

/* --- per-host queue: only the target host sleeps, not the worker --- */
void crawl_loop(void) {
    while (has_pending_urls()) {
        host_queue *hq = next_ready_host();  /* host whose delay has elapsed */
        if (!hq) {
            /* all hosts cooling down — use select() to wait efficiently */
            wait_until_next_ready();
            continue;
        }
        const char *url = dequeue(hq);
        if (already_visited(url)) continue;
        if (!check_robots_txt(hq->hostname)) continue;

        fetch_and_enqueue_links(url);
        hq->next_allowed = now_usec() + CRAWL_DELAY_US;  /* per-host timer */
    }
}

Why the Server Actually Died

The Los Alamos server was likely a Sun SPARCstation running NCSA HTTPd—the standard setup for institutional web servers in 1993. Here's what happened at the systems level:

File descriptor exhaustion: SunOS typically limited processes to ~256 file descriptors. Each TCP connection held one. My crawler opened connections faster than the server could close them.

Fork bomb effect: NCSA HTTPd forked a child process per request. Fifty concurrent forks on a machine with 32MB of RAM meant the OS was swapping to disk on every new connection. Eventually fork() returned ENOMEM.

NIC buffer saturation: With 16KB kernel socket buffers and a 10Mbps Ethernet link, sustained rapid requests starved the interrupt handler. Legitimate connections got ETIMEDOUT.

Client (my crawler)        Server (SPARCstation)
    |--- GET /page1 -------->|  fork() -> child1 (fd 3)
    |--- GET /page2 -------->|  fork() -> child2 (fd 4)
    |      ...               |       ...
    |--- GET /page254 ------>|  fork() fails: ENOMEM
    |--- GET /page255 ------>|  [connection refused]

The server didn't crash from one bad request. It drowned under hundreds of simultaneous connections from a client that never paused, never checked if the server was struggling, and never stopped following links into an expanding tree.

Reckless Crawler (1993) — What Actually Happened

 t=0.00s  GET /index.html → 200 OK (12 links)
 t=0.01s  GET /page1      → fork() child (fd 3)
 t=0.02s  GET /page2      → fork() child (fd 4)
 t=0.03s  GET /page3      → fork() child (fd 5)
          ...recursive expansion: 12 → 144 → 1728 URLs
 t=0.80s  GET /page254    → fork() fails: ENOMEM (32MB gone)
 t=0.81s  GET /page255    → [connection refused]
          Server dead. All clients affected.

Polite Crawler: How It Should Work

 t=0.00s  GET /robots.txt       → Crawl-delay: 2
 t=0.10s  GET /index.html       → 200 OK (12 links queued)
          Check other hosts for pending work...
 t=2.10s  GET /page1            → 200 OK (skip visited)
 t=4.10s  GET /page2            → 200 OK
 t=6.10s  GET /page3            → 200 OK
          Server healthy. Normal load throughout.

The technical lesson was straightforward: a single client making requests at machine speed can overwhelm a server designed for human-speed browsing. Computers operate at fundamentally different speeds than humans. A crawler that fires requests as fast as the network allows is doing something no human visitor would ever do. The server isn't built for it.

But the real lesson was about responsibility. The web was built on a set of informal agreements. You could crawl, but the expectation was thoughtfulness. Identify yourself. Back off when asked. These weren't laws. They were the social contract of a shared resource.

The Gentleman's Agreement Held. Until It Didn't.

For nearly thirty years, robots.txt worked. Not because it was technically enforceable, but because the major players had incentives to comply. Google crawled your site and sent you traffic in return. Bing did the same. The exchange was clear: you let us index your content, we send you visitors. Both sides benefited.

Then AI happened.

Starting in 2023, a new generation of crawlers appeared. GPTBot. ClaudeBot. Google-Extended. These weren't indexing content to send traffic back. They were vacuuming content to train language models. The exchange that made robots.txt work for decades was gone. The crawlers took everything and sent nothing back.

The backlash was swift. Reuters Institute research found that by end of 2023, 48% of top news websites had blocked OpenAI's crawlers. In the US, 79%. Search Engine Land reported 26 of the top 100 websites blocking GPTBot, with the number surging 250% in a single month.

The BBC. The New York Times. The Guardian. The Washington Post. NPR. One by one, publishers added GPTBot and ClaudeBot to their robots.txt files. Not to keep their content private. To stop companies from profiting off their work without permission, without payment, without even sending traffic back.

The Difference Between Stupid and Malicious

When I crashed the Los Alamos server, I was ignorant. I didn't know better. The web was new, crawler etiquette didn't exist yet, and I was a kid experimenting with a technology that had been public for maybe two years. I apologized, added a rate limiter, and never crashed another server again.

When AI companies send crawlers that ignore robots.txt, they know exactly what they're doing. These are companies with thousands of engineers, legal teams, and billions in funding. They've read robots.txt. They understand the protocol. Some of them have made public commitments to respect it and then been caught not doing so.

Some AI companies have been accused of ignoring robots.txt outright. Others use third-party scraping services that rotate residential proxies and headless browsers to mimic human traffic, bypassing IP blocks and rendering robots.txt technically moot. The data still ends up in the training set. The website owner still gets nothing.

My twenty-two-year-old self crashing a server out of ignorance is forgivable. Billion-dollar companies doing it strategically is a different thing entirely.

When the Commons Gets Enclosed

Here's what bothers me most. The web was built as a shared resource. Tim Berners-Lee made it open on purpose. Hyperlinks worked because anyone could link to anything. Crawlers worked because the content was accessible. The early internet inherited the culture of BBSs and Usenet, where sharing was the default and hoarding was the exception.

The gentleman's agreement of robots.txt encoded that culture into protocol. It said: this is a commons, and we'll maintain it through mutual respect.

What AI companies are doing is classic enclosure of the commons. They're taking a shared resource, built by millions of people over decades, and converting it into proprietary training data. Content creators get nothing. Hosting platforms get nothing. OpenAI's valuation hit $80 billion in early 2024, much of it built on content scraped from the open web. Zero dollars paid to the people who created it.

And the irony cuts deep. These models then generate content that competes with the very sources they were trained on. AI content farms are already polluting search results with machine-generated articles derived from human-written originals. The snake is eating its own tail.

What robots.txt Can't Fix

The fundamental problem is that robots.txt was designed for a different era. It assumed good faith. It assumed that the major crawlers had an incentive to comply because the relationship was mutually beneficial. It assumed a community small enough to enforce norms through reputation.

None of those assumptions hold anymore.

Why robots.txt is breaking down:

No enforcement mechanism. It's a suggestion, not a wall. Any bot can ignore it.
No reciprocal value. Search crawlers sent traffic back. AI crawlers send nothing.
Proxy scraping. Companies use intermediaries to collect data, bypassing blocks.
Scale mismatch. Millions of websites vs. a handful of AI companies with infinite budgets.
Legal ambiguity. Copyright law hasn't caught up. "Fair use" for training data is unsettled.

The Robots Exclusion Protocol was formally standardized as RFC 9309 in September 2022, but standardization doesn't create enforcement. It just documents the agreement that's already being broken.

Some have proposed technical solutions. ai.txt as a separate protocol. Machine-readable licensing metadata. Cryptographic verification of crawler identity. These might help at the margins. But they're all still voluntary. And the companies with the most to gain from ignoring them have the least incentive to comply.

The Files That Define Your Relationship With Crawlers

Most site owners have never looked at the files that govern how machines interact with their site. If you're running anything on the web, you should know what these are and what they actually do.

robots.txt: The Gentleman's Agreement

This file lives at your site root. It's the first thing any well-behaved crawler checks. Here's what a modern one looks like when you're trying to block AI scrapers while still allowing search engines:

# Allow search engines (they send traffic back)
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block AI training crawlers (they send nothing back)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Block everything else by default
User-agent: *
Disallow: /admin/
Crawl-delay: 10

# Point crawlers to your sitemap
Sitemap: https://example.com/sitemap.xml

Notice the asymmetry. Googlebot gets a green light because Google sends you traffic. GPTBot gets a wall because OpenAI sends you nothing. That distinction is the entire story of what went wrong with the web's social contract.

sitemap.xml: The Map You Hand to Crawlers

If robots.txt tells crawlers where not to go, sitemap.xml tells them where to go—a structured list of every page you want indexed, with priority hints and modification dates. The irony: AI crawlers can also use your sitemap as a convenient shopping list. You're handing them the map to the vault.

Crawling is still the right answer when it provides value back. The Internet Archive preserves human knowledge. Search engines make the web navigable. If your crawler extracts value without returning anything, you're not crawling. You're strip-mining.

The Pattern That Never Changes

Having watched technology cycles for decades, I recognize this pattern. A shared resource gets built through community effort. It works because everyone follows informal rules. Then someone figures out how to extract enormous value by breaking those rules. The commons gets enclosed. The community scrambles to create formal protections. By the time protections arrive, the damage is done.

It happened with email. Open protocol, built on trust, destroyed by spam. It happened with social media platforms that built their value on user content and then locked it behind walled gardens. Now it's happening with the web itself.

I crashed one server in 1993. I felt bad about it. I learned from it. The lesson was simple: shared resources require shared responsibility.

The companies training AI models on the open web learned a different lesson: shared resources are free inventory.

Defense in Depth: What You Can Do Now

Robots.txt won't save you from adversarial crawlers. But that doesn't mean you're helpless. Here's what actually works in production: real configs, not bullet points.

Nginx rate limiting with AI bot detection:

# /etc/nginx/conf.d/ai-bot-protection.conf

# Map AI crawler user agents to a flag
map $http_user_agent $is_ai_bot {
    default          0;
    "~*GPTBot"       1;
    "~*ClaudeBot"    1;
    "~*CCBot"        1;
    "~*Bytespider"   1;
    "~*Google-Extended" 1;
    "~*anthropic-ai" 1;
    "~*Amazonbot"    1;
}

# Rate limit zone: 1 request/second for AI bots
limit_req_zone $binary_remote_addr zone=ai_bots:10m rate=1r/s;

server {
    # Block bots that ignore robots.txt
    if ($is_ai_bot) {
        return 403;
    }

    # Rate limit everything else aggressively
    location / {
        limit_req zone=ai_bots burst=5 nodelay;
        # ... your normal config
    }
}

If you're on Cloudflare, you can do this at the edge before traffic hits your origin:

$ # Cloudflare WAF Custom Rule (Dashboard → Security → WAF)
Rule name: Block AI Training Crawlers
Expression:
  (http.user_agent contains "GPTBot") or
  (http.user_agent contains "ClaudeBot") or
  (http.user_agent contains "CCBot") or
  (http.user_agent contains "Bytespider") or
  (http.user_agent contains "Google-Extended")
Action: Block
Deployed → blocks at edge, zero origin load

None of this stops a scraper using residential proxies and headless Chrome. But it stops the lazy ones, and it establishes legal standing. When you're in court, you want proof you said no in every way available to you.

The Bottom Line

The web's original social contract was simple. Crawlers index content, send traffic back, everyone benefits. Robots.txt encoded that contract into a gentleman's agreement that held for thirty years. AI companies broke it. They take content without permission, without compensation, and without sending anything back. Then their models generate content that competes with the sources they scraped.

The technical solution isn't robots.txt. It was never designed for adversarial actors with billion-dollar incentives to ignore it. The real solution is legal and economic: clear copyright frameworks for training data, mandatory licensing, and consequences for companies that treat the open web as their private dataset.

I crashed a server at Los Alamos because I was young and didn't know any better. What's their excuse?

"I crashed a server at Los Alamos because I was young and didn't know any better. What's their excuse?"

Sources

Is Your Infrastructure Ready for AI Crawlers?

I'll audit your web infrastructure and crawler defenses before the next wave of AI bots finds the gaps.

Book Infrastructure Review

Cisco Caceres • Coding since the late 1970s. Still learning, still building.