Perplexity Caught Impersonating Google: Cloudflare Sets Trap to Expose AI Scraper Tactics

A fly lies in a Venus flytrap. Reuters/DPA/Picture Alliance

In the escalating battle over AI data rights, one of the industry’s rising stars just stepped on a digital landmine and the blast was loud.

Cloudflare, the internet infrastructure giant that powers and protects roughly 20% of the web, has publicly accused AI startup Perplexity of evading web scraping protocols, impersonating Google, and engaging in deceptive data harvesting practices. The company’s sting operation, revealed in a detailed Monday blog post, has triggered shockwaves across the tech industry and raised serious ethical concerns about how AI firms collect training data.

The Setup: A Digital Honeytrap

It all started with website operators reporting suspicious activity. Despite explicitly opting out of bot access in their robots.txt files a universally respected web standard for controlling how bots interact with content these sites were still seeing Perplexity’s crawlers showing up and accessing their data.

So Cloudflare built a trap.

The company created fake, unpublished websites with zero public links, zero search engine indexing, and clear instructions for all bots including Perplexity’s to stay out. These digital decoys were, in effect, invisible to legitimate web crawlers.

Yet somehow, Perplexity's AI still returned information from these hidden pages when queried. The only way that could happen? Perplexity had accessed and scraped the content despite being explicitly told not to.

The Mask: Impersonating Google Chrome

How did Perplexity get in?

After being blocked through conventional methods, Cloudflare says Perplexity disguised its crawlers to look like ordinary web traffic. Instead of using its known bot identifiers, it deployed generic user-agent strings designed to mimic Google’s Chrome browser on macOS making it appear as though a human was simply browsing the page.

Requests were routed through unknown or frequently rotated IP addresses and Autonomous System Numbers (ASNs), avoiding detection by traditional bot filters.

Cloudflare CEO Matthew Prince didn’t mince words. "Some supposedly 'reputable' AI companies act more like North Korean hackers," he said on X. “Time to name, shame, and hard block them.”

A Stark Contrast to OpenAI

To highlight the difference in ethical standards, Cloudflare pointed to OpenAI the company behind ChatGPT as a model for responsible behavior.

OpenAI’s crawlers, according to Cloudflare, respect robots.txt files. When blocked, they stop crawling. No tricks. No disguises. No workarounds.

By contrast, Perplexity’s covert scraping tactics designed to sidestep consent represent a growing concern for content creators, publishers, and internet infrastructure firms. In a world where AI models depend on data, the way it’s collected matters more than ever.

Why It Matters: The Fight for Data Control

AI startups need massive amounts of data to train their models, but web content isn't free just because it’s visible. Respecting data permissions especially robots.txt is foundational to the open web.

What Perplexity did, Cloudflare argues, isn’t just an ethical breach; it breaks the informal contract that allows the internet to function openly and cooperatively.

Perplexity did not respond to Truth Sider’s request for comment. Google, whose Chrome browser was impersonated in the crawling process, also declined to weigh in.

Consequences and Fallout

Cloudflare has now removed Perplexity from its list of verified bots a designation that helps identify trustworthy crawlers and rolled out new detection tools to automatically block stealth scraping attempts across its global network.

For web publishers, the message is clear: more tools are becoming available to help enforce data boundaries. For AI companies, the message is even louder: obey the rules, or face public accountability and digital isolation.

As Matthew Prince put it: “This is about the future of the web.”

Post a Comment