I wanted to cache everything behind Cloudflare and stop worrying about PHP, MySQL, and plugin drift on my backup site cosmin.gq. I tried one-click export plugins (both the free and paid Simply Static), but I couldn’t get a clean export that behaved like the live site. So I rolled up my sleeves and wrote a Bash script that:
- crawls the site using the official sitemaps (including .gz children),
- mirrors HTML + assets incrementally,
- cleans up quirks like ?p= permalinks,
- generates a small search index (no dependencies, no bs4),
- adds a lightweight lightbox for images,
- and serves it all from Cloudflare Workers Static Assets with a smart router that understands /path, /path/, /path.html, and /path/index.html.
Below is the story of how each section of the script came to be, what went wrong, what I learned, and why each piece is there.
Why not “just use a plugin”?
I started with Simply Static (free) and then bought the pro version. It got me a partial archive, but:
- Some internal links still pointed back to dynamic WordPress routes (?p=123 or search params).
- A lot of assets didn’t come along for the ride.
- “Clean URLs” versus index.html handling wasn’t consistent once deployed behind a CDN.
After a few rounds of chasing broken links, I decided to build my own exporter so I could control each step. That’s when this script was born.
Config (override via flags or env)
I wanted a single file that I could run anywhere and reuse for staging vs. production. The very top sets the defaults and lets me flip them via flags or environment variables:
- SITE_ROOT – where to crawl from (e.g., https://cosmin.us)
- OUTDIR – where to put the static bundle (public)
- SERVICE_NAME, COMPAT_DATE – Cloudflare Worker config
- EXTRA_SEEDS_DEFAULT – manual seeds for archive pages /page/2/ … /page/14/
# ========================= # Config (override via flags or env) # ========================= SITE_ROOT="${SITE_ROOT:-https://cosmin.us}" OUTDIR="${OUTDIR:-public}" SERVICE_NAME="${SERVICE_NAME:-cosmin-us-worker}" # Worker service name COMPAT_DATE="${COMPAT_DATE:-2025-08-21}"
Export jobs are long-running. I didn’t want to hardcode anything. The seeds ensure the crawler doesn’t miss older posts if they’re not balanced in the sitemaps.
Dependency check
# =============== # Dependency check # =============== need() { command -v "$1" >/dev/null 2>&1 || { echo "Missing dependency: $1"; exit 1; }; } need curl need python3 need wget need gunzip need node need npx need perl
I got tired of “command not found” halfway through. The script fails fast with a friendly message if a tool isn’t present.
Fresh workspace (optional)
--clean
# ==================
# Fresh workspace (optional)
# ==================
if [[ $CLEAN -eq 1 ]]; then
echo "==> CLEAN mode: removing $OUTDIR and temp files"
rm -rf "$OUTDIR"
fi
If I pass –clean, it wipes public/ and starts fresh.
Incremental runs are great, but sometimes you want to start over (e.g., after changing permalink logic).
Build urls.txt with a robust Python parser (handles sitemap index + .gz)
WordPress often exposes a sitemap index that points at multiple sitemaps (sometimes compressed). The Python snippet:
- fetches …/sitemap_index.xml or …/sitemap.xml,
- if it’s an index, it follows every child sitemap (plain or gzipped),
- extracts every <loc>,
- filters out off-domain URLs, feeds, AMP, embeds, and binary files,
- and adds extra archive pages (/page/2/…/page/14/) so nothing gets missed.
# ================== # Build urls.txt with robust Python parser (handles sitemap index + .gz) # ================== echo "==> Parsing sitemap(s) to build urls.txt" python3 - "$SITE_ROOT" > urls.txt <<'PY' import sys, io, gzip, re, urllib.request, xml.etree.ElementTree as ET from urllib.parse import urlparse SITE_ROOT = sys.argv[1].rstrip("/") DOMAIN = urlparse(SITE_ROOT).netloc def fetch(url: str) -> bytes: req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0 (StaticMirror)"} ) with urllib.request.urlopen(req, timeout=30) as resp: return resp.read() def xml_urls_from_bytes(data: bytes): urls = [] try: it = ET.iterparse(io.BytesIO(data)) for _, el in it: if el.tag.endswith("loc") and el.text: urls.append(el.text.strip()) except Exception: urls += re.findall(rb"<loc>\s*([^<\s]+)\s*</loc>", data, flags=re.I) urls = [u.decode("utf-8", "ignore") for u in urls] return urls def expand_sitemaps(root_index_url: str): out = set() try_urls = [root_index_url, f"{SITE_ROOT}/sitemap.xml"] index_data = None for u in try_urls: try: index_data = fetch(u); break except Exception: continue if not index_data: sys.exit(2) text = index_data[:2000].lower() if b"<sitemapindex" in text: for sm in xml_urls_from_bytes(index_data): try: data = fetch(sm) except Exception: try: data = gzip.decompress(fetch(sm)) except Exception: continue out.update(xml_urls_from_bytes(data)) else: out.update(xml_urls_from_bytes(index_data)) return sorted(out) urls = expand_sitemaps(f"{SITE_ROOT}/sitemap_index.xml") # Filter to same-domain, canonical pages (no query, feed, amp, embed, skip typical binaries) filtered = [] for u in urls: try: pu = urlparse(u) if pu.netloc != DOMAIN: continue if pu.query: continue path = pu.path or "/" if re.search(r"/(feed|amp|embed)/?$", path, flags=re.I): continue if re.search(r"\.(jpg|jpeg|png|gif|webp|svg|pdf|mp4|mov|mp3|zip)$", path, flags=re.I): continue filtered.append(u) except Exception: pass # Add deep archive seeds up to /page/14/ extra = [f"{SITE_ROOT}/"] + [f"{SITE_ROOT}/page/{i}/" for i in range(2, 15)] for e in extra: if e not in filtered: filtered.append(e) for u in sorted(set(filtered)): print(u) PY TOTAL_URLS=$(wc -l < urls.txt | tr -d ' ') echo "==> Total canonical URLs to fetch: $TOTAL_URLS" if [[ "$TOTAL_URLS" -eq 0 ]]; then echo "!! No URLs produced from sitemap(s); aborting." exit 1 fi echo "==> Sample of urls.txt:" head -n 10 urls.txt || true echo "..."
Earlier attempts only mirrored the homepage and recent posts. The sitemap index + .gz expansion ensures a completeURL list that represents your real content.
Download pages + assets (resilient; non-fatal failures)
This was a big one. My first approach used wget with set -e, which meant one flaky URL could stop the whole run. I wrapped the download step like this:
- Temporarily disable set -e around wget so we never abort the script.
- Use –page-requisites + –convert-links + –adjust-extension for a faithful offline copy.
- Allow common CDN hosts via –span-hosts and –domains=….
- Add retries and timeouts.
- Log to public/wget.log and show a quick error summary at the end.
# ================== # Download pages + assets (resilient; non-fatal failures) # ================== echo "==> Download pages + assets (incremental with timestamping)" WGET_LOG="$OUTDIR/wget.log" # Temporarily disable -e so wget errors don't abort the script set +e wget -e robots=off \ --directory-prefix="$OUTDIR" \ --page-requisites \ --convert-links \ --adjust-extension \ --no-host-directories \ --trust-server-names \ --input-file=urls.txt \ --reject-regex '.*\?(p|replytocom|s)=.*' \ --user-agent="Mozilla/5.0 (StaticMirror via wget)" \ --timestamping \ --span-hosts \ --domains="$DOMAIN,i0.wp.com,code.jquery.com,cdnjs.cloudflare.com,cdn.jsdelivr.net,unpkg.com,fonts.googleapis.com,fonts.gstatic.com,ajax.googleapis.com,stackpath.bootstrapcdn.com" \ --continue \ --retry-connrefused \ --tries=3 \ --waitretry=5 \ --timeout=10 \ -nv -o "$WGET_LOG" WGET_EXIT=$? set -e if [[ $WGET_EXIT -ne 0 ]]; then echo "!! wget returned exit code $WGET_EXIT — continuing (non-fatal). See $WGET_LOG for details." fi
The internet is messy. A single CDN hiccup shouldn’t kill a 1-hour export.
Safety: remove any weird files containing ?p=
# Safety: remove any weird files containing '?p=' echo "==> Cleanup any query-style pages" find "$OUTDIR" -type f -name '*\?p=*' -print -delete || true
Legacy WordPress permalinks sometimes appear as query URLs. We don’t want those saved as literal filenames because they can’t be routed cleanly by a static server.
Build expected HTML manifest from URLs
The sitemap gives us canonical URLs; the mirror produces a bunch of files. This block maps URLs to files we expect to exist:
- / → public/index.html
- /foo/ → public/foo/index.html
- /foo → public/foo.html
# ==================
# Build expected HTML manifest from URLs
# ==================
echo "==> Building expected HTML manifest"
MANIFEST="$OUTDIR/.manifest.txt"
> "$MANIFEST"
while read -r url; do
path="${url#"$SITE_ROOT"}"
path="${path#/}"
if [[ -z "$path" ]]; then
echo "$OUTDIR/index.html" >> "$MANIFEST"
else
if [[ "$path" =~ /$ ]]; then
echo "$OUTDIR/$path/index.html" >> "$MANIFEST"
else
echo "$OUTDIR/${path}.html" >> "$MANIFEST"
fi
fi
done < urls.txt
It then writes a manifest file we can use to spot “stale” HTML files.
Mirrors are noisy. I wanted a deterministic list of what should exist so I could prune anything that no longer maps to a published URL.
Always keep 404 + control files
I make sure 404.html exists and add _headers if missing.
# Always keep 404 + control files
echo "$OUTDIR/404.html" >> "$MANIFEST"
[[ -f "$OUTDIR/_headers" ]] && echo "$OUTDIR/_headers" >> "$MANIFEST"
[[ -f "$OUTDIR/_redirects" ]] && echo "$OUTDIR/_redirects" >> "$MANIFEST"
sort -u "$MANIFEST" -o "$MANIFEST"
# cross-platform sed inline
sed -i.bak '/^[[:space:]]*$/d' "$MANIFEST" 2>/dev/null || true
rm -f "$MANIFEST.bak"
Why this exists: Cloudflare Workers Static Assets can serve a custom 404. A small _headers file helps with caching hints. The 404 page also doubles as a friendly “oops” if a route slips through.
Create 404 + headers if missing
# ==================
# Create 404 + headers if missing
# ==================
if [[ ! -f "$OUTDIR/404.html" ]]; then
echo "==> Creating $OUTDIR/404.html"
cat > "$OUTDIR/404.html" <<'HTML'
<!doctype html><meta charset="utf-8">
<title>404</title>
<style>
body{font:16px/1.4 system-ui,-apple-system,Segoe UI,Roboto,Arial,sans-serif;padding:4rem;max-width:60ch;margin:auto;}
h1{font-size:2rem;margin-bottom:.25rem}
a{color:inherit}
</style>
<h1>Not found</h1>
<p>Sorry, we couldn’t find that page. <a href="/">Go home →</a></p>
HTML
fi
if [[ ! -f "$OUTDIR/_headers" ]]; then
echo "==> Creating $OUTDIR/_headers (basic caching)"
cat > "$OUTDIR/_headers" <<'H'
/*
Cache-Control: public, max-age=3600
H
fi
If you’ve never run the script, it auto-creates a minimal 404.html and _headers.
Running this on a fresh machine should “just work” without a pile of prerequisite files.
Prune removed HTML pages (keep assets to be safe)
# ==================
# Prune removed HTML pages (keep assets to be safe)
# ==================
echo "==> Pruning HTML files that disappeared from the sitemap"
mapfile -t ALL_HTML < <(find "$OUTDIR" -type f -name '*.html' | sort -u)
if [[ ${#ALL_HTML[@]} -gt 0 ]]; then
printf "%s\n" "${ALL_HTML[@]}" | grep -vxFf "$MANIFEST" || true | while read -r stale; do
[[ -z "$stale" ]] && continue
echo "Deleting stale: $stale"
rm -f "$stale"
# Remove empty parent dirs up to OUTDIR
dir="$(dirname "$stale")"
while [[ "$dir" != "$OUTDIR" && "$dir" != "/" ]]; do
rmdir "$dir" 2>/dev/null || break
dir="$(dirname "$dir")"
done
done
fi
Using the manifest, we delete HTML pages that no longer appear in the sitemap (but we don’t delete images/CSS/JS to avoid breaking older links in posts). We also remove empty directories after deleting.
The site changes. Pruning avoids shipping dead pages forever while keeping assets conservative and safe.
Worker source + config (Workers Static Assets)
I switched to Cloudflare Workers instead of Pages, and bound the static bundle as an asset directory:
# Wrangler config for Workers + Static Assets
cat > wrangler.toml <<TOML
name = "${SERVICE_NAME}"
main = "src/worker.mjs"
assets = { directory = "${OUTDIR}", binding = "ASSETS" }
compatibility_date = "${COMPAT_DATE}"
TOML
Workers gives me a programmable edge (router, fallbacks) while letting the CDN serve my static files fast.
Enhanced router: resolves /path, /path/, /path.html, /path/index.html
The Worker code is small but crucial. It tries a list of candidates in order:
- exact path,
- add trailing slash,
- append .html,
- try …/index.html,
- and vice-versa (if you requested .html, also try the clean versions).
cat > src/worker.mjs <<'JS'
export default {
async fetch(request, env) {
const orig = new URL(request.url);
const tryPaths = async (paths) => {
for (const p of paths) {
const url = new URL(request.url);
url.pathname = p;
const res = await env.ASSETS.fetch(new Request(url, request));
if (res.status !== 404) return res;
}
return new Response(null, { status: 404 });
};
const p = orig.pathname;
const hasDot = p.includes(".");
const endsSlash = p.endsWith("/");
const candidates = [];
candidates.push(p); // exact
if (!hasDot) { // /foo
if (!endsSlash) candidates.push(p + "/"); // /foo/
candidates.push(p + ".html"); // /foo.html
candidates.push((endsSlash ? p : p + "/") + "index.html"); // /foo/index.html
}
if (endsSlash) { // /foo/
candidates.push(p + "index.html");
}
if (p.endsWith(".html")) { // /foo.html or /foo/index.html
const base = p.replace(/\/index\.html$/i, "/").replace(/\.html$/i, "/");
candidates.push(base);
candidates.push(base + "index.html");
}
if (p.endsWith("/index.html")) { // /foo/index.html
const clean = p.replace(/\/index\.html$/i, "/");
candidates.push(clean);
}
const res = await tryPaths(candidates);
if (res.status === 404) {
return env.ASSETS.fetch(new Request(new URL("/404.html", orig), request));
}
return res;
}
};
JS
Finally, it falls back to 404.html.
A static export will naturally produce both foo.html and foo/index.html patterns depending on how links were written. The router makes them interchangeable, which eliminates 404s when a theme mixes formats.
Lightweight lightbox
I wanted to keep the “click image to expand” behavior without depending on jQuery or a big gallery library. The lightbox:
- injects a tiny CSS + JS pair into all HTML,
- recognizes links to images or bare <img> tags,
- shows them in an overlay with Prev/Next,
- supports keyboard navigation and ESC to close.
# ================== # Lightweight lightbox (uses $OUTDIR) # ================== echo "==> Installing lightweight lightbox" mkdir -p "$OUTDIR/_lightbox" # CSS cat > "$OUTDIR/_lightbox/lightbox.css" <<'CSS' .lb-overlay{position:fixed;inset:0;background:rgba(0,0,0,.9);display:none;align-items:center;justify-content:center;z-index:9999} .lb-overlay.open{display:flex} .lb-img{max-width:96vw;max-height:92vh;box-shadow:0 10px 40px rgba(0,0,0,.6)} .lb-close{position:fixed;top:14px;right:16px;font:600 16px/1 system-ui,-apple-system,Segoe UI,Roboto,Arial,sans-serif;color:#fff;background:rgba(255,255,255,.15);border:0;border-radius:6px;padding:10px 12px;cursor:pointer} .lb-nav{position:fixed;top:50%;transform:translateY(-50%);display:flex;gap:6px} .lb-prev,.lb-next{font:600 16px/1 system-ui,-apple-system,Segoe UI,Roboto,Arial,sans-serif;color:#fff;background:rgba(255,255,255,.15);border:0;border-radius:6px;padding:10px 12px;cursor:pointer} .lb-prev{left:16px;position:fixed} .lb-next{right:16px;position:fixed} CSS # JS cat > "$OUTDIR/_lightbox/lightbox.js" <<'JS' (() => { const IMG_RE = /\.(png|jpe?g|gif|webp|bmp|svg)(\?.*)?$/i; const overlay = document.createElement('div'); overlay.className = 'lb-overlay'; overlay.innerHTML = ` <button class="lb-close" aria-label="Close">Close ✕</button> <div class="lb-nav"> <button class="lb-prev" aria-label="Previous">‹ Prev</button> <button class="lb-next" aria-label="Next">Next ›</button> </div> <img class="lb-img" alt=""> `; document.documentElement.appendChild(overlay); const imgEl = overlay.querySelector('.lb-img'); const btnClose = overlay.querySelector('.lb-close'); const btnPrev = overlay.querySelector('.lb-prev'); const btnNext = overlay.querySelector('.lb-next'); const candidates = Array.from(document.querySelectorAll('a[href], img')); const items = []; const indexByEl = new Map(); candidates.forEach(el => { if (el.tagName === 'A' && IMG_RE.test(el.getAttribute('href')||'')) { items.push({el, src: el.href}); indexByEl.set(el, items.length-1); } else if (el.tagName === 'IMG' && IMG_RE.test(el.getAttribute('src')||'')) { if (!el.closest('a[href]')) { items.push({el, src: el.src}); indexByEl.set(el, items.length-1); } } }); let idx = -1; function openAt(i){ if(i<0||i>=items.length) return; idx=i; imgEl.src=items[idx].src; overlay.classList.add('open'); document.body.style.overflow='hidden'; const multi = items.length>1; btnPrev.style.display = multi?'':'none'; btnNext.style.display = multi?'':'none'; } function close(){ overlay.classList.remove('open'); document.body.style.overflow=''; imgEl.removeAttribute('src'); } function next(){ if(items.length>0) openAt((idx+1)%items.length); } function prev(){ if(items.length>0) openAt((idx-1+items.length)%items.length); } document.addEventListener('click', (e) => { const a = e.target.closest('a[href]'); if (a && IMG_RE.test(a.getAttribute('href')||'')) { e.preventDefault(); openAt(indexByEl.get(a)); } else if (e.target.tagName === 'IMG' && IMG_RE.test(e.target.getAttribute('src')||'') && !e.target.closest('a[href]')) { e.preventDefault(); openAt(indexByEl.get(e.target)); } }); overlay.addEventListener('click', (e) => { if (e.target === overlay) close(); }); btnClose.addEventListener('click', close); btnNext.addEventListener('click', next); btnPrev.addEventListener('click', prev); document.addEventListener('keydown', (e) => { if (!overlay.classList.contains('open')) return; if (e.key === 'Escape') close(); if (e.key === 'ArrowRight') next(); if (e.key === 'ArrowLeft') prev(); }); })(); JS echo "==> Injecting lightbox tags into HTML" find "$OUTDIR" -type f -name '*.html' | while read -r f; do if grep -q '_lightbox/lightbox.js' "$f"; then continue; fi if grep -qi '</head>' "$f"; then perl -0777 -pe 's#</head># <link rel="stylesheet" href="/_lightbox/lightbox.css">\n <script src="/_lightbox/lightbox.js" defer></script>\n</head>#i' -i "$f" else printf '\n<link rel="stylesheet" href="/_lightbox/lightbox.css">\n<script src="/_lightbox/lightbox.js" defer></script>\n' >> "$f" fi done
Some of my posts rely on bigger images. A tiny dependency-free lightbox preserves that UX while keeping the static bundle small.
Client-side search (no bs4)
I still wanted a search box that felt native. Serverless means client-side:
- A Python stdlib script walks every HTML file, strips scripts/styles, extracts the <title> and text, and writes search-index.json (no bs4).
- A small /search/ page loads that JSON and does a straightforward token match:
- title hits score 5×,
- content hits score 1×,
- requires that all query tokens appear at least once,
- shows top 50 with highlighted snippets.
- A tiny “search bootstrap” script rewires any WP search forms to submit to /search/?s=….
# ================== # Client-side search (no bs4) # ================== echo "==> Creating client-side search assets" mkdir -p "$OUTDIR/_search" "$OUTDIR/search" # Rewire WP-style search forms to /search/?s=... cat > "$OUTDIR/_search/search-boot.js" <<'JS' (() => { function hookForms(){ const forms = document.querySelectorAll('form'); forms.forEach(f => { const q = f.querySelector('input[name="s"], input[type="search"]'); if (!q) return; f.method = 'GET'; f.action = '/search/'; q.name = 's'; if (!q.placeholder) q.placeholder = 'Search...'; }); } if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', hookForms); } else { hookForms(); } })(); JS # /search/ page cat > "$OUTDIR/search/index.html" <<'HTML' <!doctype html> <meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1"> <title>Search</title> <style> body{font:16px/1.5 system-ui,-apple-system,Segoe UI,Roboto,Arial,sans-serif;max-width:72ch;margin:2rem auto;padding:0 1rem} form{display:flex;gap:.5rem;margin-bottom:1rem} input[type=search]{flex:1;padding:.6rem;border:1px solid #ccc;border-radius:6px} button{padding:.6rem .9rem;border:1px solid #ccc;border-radius:6px;background:#f7f7f7;cursor:pointer} .res{padding:1rem 0;border-top:1px solid #eee} .res h3{margin:.2rem 0 .3rem;font-size:1.05rem} .res p{margin:.2rem 0;color:#444} .muted{color:#666;font-size:.9rem} </style> <h1>Search</h1> <form action="/search/" method="GET"> <input type="search" name="s" placeholder="Search..." /> <button type="submit">Search</button> </form> <div id="status" class="muted">Type a query and press Enter.</div> <div id="results"></div> <script> (function(){ const $ = s => document.querySelector(s); const params = new URLSearchParams(location.search); const q = (params.get('s')||'').trim(); const status = $('#status'); const resultsEl = $('#results'); const input = document.querySelector('input[name="s"]'); if (q) input.value = q; const STOP = new Set("a an the and or to of in on for with at from by is are was were be been being as it its not no this that these those you your me my we our they their i he she them his her".split(' ')); const tok = t => (t||'').toLowerCase().replace(/[^\p{L}\p{N}\s]/gu,' ').split(/\s+/).filter(w => w && !STOP.has(w)); async function run(){ if (!q){ status.textContent='Type a query and press Enter.'; return; } status.textContent='Searching…'; const qtok = tok(q); if (!qtok.length){ status.textContent='Try a longer query.'; return; } let idx; try { const r = await fetch('/search-index.json', {cache:'no-store'}); idx = await r.json(); } catch(e){ status.textContent='Could not load the search index.'; return; } const hit = []; for (const d of idx){ const t = d.title_lc, c = d.content_lc; let ok = true, score = 0; for (const term of qtok){ const rg = new RegExp("\\b"+term.replace(/[.*+?^${}()|[\]\\]/g,'\\$&')+"\\b","g"); const tc = (t.match(rg)||[]).length, cc = (c.match(rg)||[]).length; if ((tc+cc)===0){ ok=false; break; } score += tc*5 + cc*1; } if (ok) hit.push({d,score}); } hit.sort((a,b)=>b.score-a.score); const top = hit.slice(0,50); resultsEl.innerHTML = ''; if (!top.length){ status.textContent='No results.'; return; } status.textContent = `Found ${hit.length} result${hit.length>1?'s':''}. Showing top ${top.length}.`; for (const {d} of top){ const div = document.createElement('div'); div.className='res'; div.innerHTML = ` <h3><a href="${d.url}">${escapeHTML(d.title||d.url)}</a></h3> <div class="muted">${d.url}</div> <p>${snippet(d.content, qtok, 220)}</p>`; resultsEl.appendChild(div); } function escapeHTML(s){return (s||'').replace(/[&<>"]/g,m=>({ '&':'&','<':'<','>':'>','"':'"'}[m]));} function snippet(text, terms, len){ const l = text.toLowerCase(); let pos = Infinity; for (const t of terms){ const i=l.indexOf(t); if (i>=0) pos = Math.min(pos,i); } if (!isFinite(pos)) pos = 0; const start = Math.max(0, pos - Math.floor(len/3)); let sn = text.slice(start, start+len); terms.forEach(t=>{ const r=new RegExp('('+t.replace(/[.*+?^${}()|[\]\\]/g,'\\$&')+')','ig'); sn = sn.replace(r,'<mark>$1</mark>'); }); if (start>0) sn = '…'+sn; if ((start+len)<text.length) sn += '…'; return sn; } } run(); })(); </script> HTML echo "==> Building search-index.json" python3 - <<'PY' import os, re, json from html.parser import HTMLParser OUTDIR = os.environ.get('OUTDIR','public') SKIP_DIRS = {'_lightbox','_search'} MAX_CHARS = 8000 class Stripper(HTMLParser): def __init__(self): super().__init__(convert_charrefs=True) self.hide = 0; self.text=[] def handle_starttag(self, tag, attrs): if tag in ('script','style','noscript','iframe'): self.hide += 1 def handle_endtag(self, tag): if tag in ('script','style','noscript','iframe') and self.hide>0: self.hide -= 1 def handle_data(self, data): if not self.hide and data.strip(): self.text.append(data.strip()) def clean(s): return re.sub(r'\s+',' ', s).strip() docs=[] for root, dirs, files in os.walk(OUTDIR): dirs[:] = [d for d in dirs if d not in SKIP_DIRS] for f in files: if not f.endswith('.html'): continue path = os.path.join(root,f) rel = os.path.relpath(path, OUTDIR).replace('\\','/') if rel == 'index.html': url = '/' elif rel.endswith('/index.html'): url = '/' + rel[:-11] + '/' else: url = '/' + rel try: html = open(path,'r',encoding='utf-8',errors='ignore').read() except Exception: continue m = re.search(r'<title[^>]*>(.*?)</title>', html, flags=re.I|re.S) title = clean(m.group(1)) if m else '' p = Stripper() try: p.feed(html) except Exception: pass content = clean(' '.join(p.text))[:MAX_CHARS] if not title and not content: continue docs.append({"url":url,"title":title,"title_lc":title.lower(),"content_lc":content.lower(),"content":content[:600]}) # de-dup by url (keep longer) by = {} for d in docs: if (d["url"] not in by) or (len(d["content_lc"])>len(by[d["url"]]["content_lc"])): by[d["url"]]=d final = [by[k] for k in sorted(by.keys())] open(os.path.join(OUTDIR,'search-index.json'),'w',encoding='utf-8').write(json.dumps(final,ensure_ascii=False,separators=(',',':'))) PY echo "==> Injecting search bootstrap into HTML" find "$OUTDIR" -type f -name '*.html' | while read -r f; do grep -q '/_search/search-boot.js' "$f" && continue if grep -qi '</head>' "$f"; then perl -0777 -pe 's#</head># <script src="/_search/search-boot.js" defer></script>\n</head>#i' -i "$f" else printf '\n<script src="/_search/search-boot.js" defer></script>\n' >> "$f" fi done
I didn’t want to lose search after going static. This approach is fast, privacy-friendly (no third-party API), and works great for a few thousand posts.
What I tried that didn’t work (and why I changed it)
- Plugin exports: got partial results, but internal link hygiene and pagination were inconsistent at scale.
- Strict wget + set -e: a single 404/timeout nuked the run. I switched to tolerant logging with retries.
- Serving only one URL pattern: real-world content had both foo.html and foo/. The Worker router now tries them all.
- Search server-side: not an option in a pure static setup. The client-side index solved it without dependencies.
Takeaways
- Own the crawl: sitemaps + manual archive seeds gave me full coverage.
- Be resilient: network hiccups are normal; don’t let them kill the job.
- Normalize routes: a flexible router is the difference between “almost works” and “rock solid”.
- Keep UX touches: a tiny lightbox and client-side search make a static site feel dynamic without servers.