The SEO Black Hole: Fixing Indexing Issues on Cloudflare Worker Proxied Blogs

Crafting seamless user experiences with a passion for headless CMS, Vercel deployments, and Cloudflare optimization. I'm a Full Stack Developer with expertise in building modern web applications that are blazing fast, secure, and scalable. Let's connect and discuss how I can help you elevate your next project!
By Ewan Mak | Tenten.co Team
You’ve done everything by the book. You set up Google Search Console, submitted a pristine sitemap.xml, and cleared your robots.txt. You’re publishing killer, high-depth content—like the pieces we’ve been working on for Neteon detailing Edge AI and industrial computing—but when you check Google, your /blog/ subdirectory is a ghost town. Zero indexed pages.
Every day those pages sit unindexed is a day you're handing high-intent, enterprise traffic straight to your competitors. If you are running a reverse proxy via Cloudflare Workers to serve your blog (like Ghost, Webflow, or WordPress) under a subdirectory, you might have accidentally built a beautiful, invisible site.
At Tenten, we partner with teams to deploy these modern, headless architectures all the time. While proxying is fantastic for user experience and keeping your root domain clean, it is notorious for causing silent SEO roadblocks.
Here is exactly why your proxied site is getting ignored by Googlebot, and how to fix it before you lose out on any more organic traction.
The Architecture: Why Proxies Confuse Crawlers
When you route traffic through yourdomain.com/blog/ using a Cloudflare Worker, the origin server generating the content usually has no idea it’s being proxied. It still thinks it lives on your-staging-url.webflow.io or blog-origin.ghost.io.
Because of this identity crisis, the origin server hands the Cloudflare Worker a bunch of mixed signals, which the Worker blindly passes on to Google. Here are the four primary culprits.
1. The Canonical URL Mismatch (The #1 Offender)
Even though users are hitting your clean subdirectory, the origin CMS is injecting its own <link rel="canonical"> tag into the <head> of the HTML.
The Problem: The tag says
href="https://[ORIGIN-DOMAIN]/post-name". Googlebot crawls your nice proxy URL, reads the canonical tag, assumes your production site is just a duplicate, and drops it from the index entirely to honor the origin domain.The Fix: Your Cloudflare Worker must intercept the HTML response on the fly. You need to utilize Cloudflare’s
HTMLRewriterAPI to find any instance of the origin URL in the head tags and dynamically rewrite it to your production/blog/path before it hits the browser.
2. Contaminated XML Sitemaps
If you submitted your sitemap to GSC and walked away, you might want to look closer at the URLs inside it.
The Problem: The proxy is likely just fetching the raw XML directly from the origin CMS. This means all the
<loc>tags inside the sitemap are hardcoded to the origin domain. Google sees a sitemap on your main domain pointing entirely to a different server, invalidates it, and moves on.The Fix: Your Worker needs a specific route just for
sitemap.xml. It must fetch the file from the origin, run a global string replacement to swap the origin domain for your live domain, and then return the modified XML to Googlebot.
3. Rogue X-Robots-Tag HTTP Headers
Because origin servers are usually meant to stay hidden behind the main site, developers often configure them to discourage search engines (so the origin doesn't accidentally outrank the main site).
The Problem: The origin server might be passing a hidden HTTP header:
X-Robots-Tag: noindex. When the Worker proxies the request, it passes all origin headers straight through. Googlebot reads this invisible header—even if the visible HTML is perfectly optimized—and refuses to index the page.The Fix: Inside your Worker script, intercept the
Responseobject and explicitly delete or overwrite theX-Robots-Tagheader before passing the response to the client.
4. Overzealous Bot Fight Mode
Cloudflare's security features are top-tier, but they can sometimes be a little too aggressive.
The Problem: If you have "Bot Fight Mode" or strict WAF rules enabled, Cloudflare might occasionally serve a JavaScript challenge to Googlebot instead of your blog's HTML. If Googlebot hits a captcha, it can't read your content.
The Fix: Review your WAF rules to ensure known, verified bots (like Googlebot) are explicitly bypassed or allowed through the specific route your Worker operates on.
Stop Missing Out on Organic Traffic
Having a technically sound proxy isn't just a nice-to-have; it's the gatekeeper to your site's visibility. Search engines won't wait around while your origin server and proxy figure out how to talk to each other.
By implementing HTMLRewriter for your canonicals, sanitizing your sitemaps on the fly, and stripping rogue headers, you can open the floodgates and get those high-value pages indexed.
Would you like me to draft the specific Cloudflare Worker JavaScript code using HTMLRewriter to handle the canonical tags and sitemap replacement for the Neteon project?





