diff --git a/PLAN.md b/PLAN.md index 61a9b7c..c871648 100644 --- a/PLAN.md +++ b/PLAN.md @@ -114,102 +114,39 @@ Deduplication via `GROUP BY url_host_name` with `first(... ORDER BY ...)` aggreg **Done when:** 100K hosts in the database with valid WARC coordinates. -### Step 1.3: Validate WARC Coordinates +### Step 1.3: Validate WARC Coordinates [COMPLETED] -Quick sanity check — before writing the full WARC parser, confirm we can actually fetch WARC records: - -```bash -# Pick a random row -psql -c "SELECT warc_filename, warc_record_offset, warc_record_length FROM hosts ORDER BY random() LIMIT 1;" - -# Fetch it with curl byte-range -curl -r $OFFSET-$((OFFSET + LENGTH - 1)) "https://data.commoncrawl.org/$WARC_FILENAME" | head -c 500 -``` - -Should see a WARC record header followed by HTTP response headers and HTML. - -**Done when:** We can manually fetch 3-5 WARC records and see valid HTML content. +Manually fetched WARC records with curl byte-range requests to `data.commoncrawl.org`. Confirmed valid WARC headers, HTTP response, and HTML with `` and `<link rel="icon">` tags. --- -## Phase 2: WARC Parsing (Stage 2) +## Phase 2: WARC Parsing (Stage 2) [COMPLETED] -### Step 2.1: Go Project Setup +### Steps 2.1-2.3 [COMPLETED] -Set up the shared Go module and the WARC parser binary: +Binary: `pipeline/02_warc_parse/` (5 files: main.go, warc.go, parser.go, process.go, db.go, log.go) -``` -pipeline/02_warc_parse/ -├── main.go # Entry point, CLI flags, orchestration -├── warc.go # WARC record fetching (S3 byte-range) -├── parser.go # HTML parsing (title, link rel=icon, iframe headers) -└── db.go # Postgres batch read/write -``` +**Architecture:** +- Fetches WARC records via AWS SDK S3 byte-range GetObject (using EC2 instance profile credentials) +- Parses WARC records with `github.com/nlnwa/gowarc/v3` +- Parses HTML with `golang.org/x/net/html` tokenizer (lenient, stops at `<body>`) +- Detects charset via `golang.org/x/net/html/charset` and converts to UTF-8 +- Sanitizes titles with `strings.ToValidUTF8` as final safety net +- Concurrent goroutine pool with configurable concurrency +- Per-host log lines to stdout + optional log file +- Panic recovery per goroutine (logs PANIC, doesn't mark row as parsed) +- DB errors tracked and logged with `DB_ERROR:` prefix -Dependencies: -- `github.com/nlnwa/gowarc/v3` — WARC record parser (actively maintained, v3.1.0, handles record envelope + HTTP response extraction correctly) -- `github.com/jackc/pgx/v5` — Postgres driver (pool, batch operations) -- `golang.org/x/net/html` — Lenient HTML parser -- `github.com/schollz/progressbar/v3` — Progress bar with ETA, rate, counters -- Standard library `net/http` for S3 byte-range requests +**CLI:** `./warc_parse --db URL [--concurrency N] [--batch-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]` -CLI flags: -- `--db` connection string -- `--batch-size` (default 500) -- `--concurrency` (default 1000) -- `--dry-run` (print parsed results, don't write to DB) -- `--limit` (process at most N rows, for testing) - -All Go programs display a live progress bar showing: items processed, items/sec, ETA, error count. On completion, print a summary with total duration. - -**Done when:** Project compiles, connects to DB, can read a batch of hosts rows. - -### Step 2.2: WARC Fetch + Parse Logic - -Implement: -1. Byte-range fetch from `https://data.commoncrawl.org/{warc_filename}` -2. Parse WARC record envelope (find the HTTP response within) -3. Extract HTTP response headers: - - `X-Frame-Options` → if present and not `ALLOWALL`, iframe_allowed = false - - `Content-Security-Policy` → check for `frame-ancestors` directive -4. Parse HTML body: - - Extract `<title>` content (first title tag, truncate at 512 chars) - - Extract all `<link rel="icon">` and `<link rel="shortcut icon">`: - - href (resolve relative URLs against `{protocol}://{hostname}/`) - - type attribute (if present) - - sizes attribute (if present) - - Ignore data: URIs, ignore links to other domains' icons for now - -**Dry-run test:** Run with `--limit 100 --dry-run` and inspect output. Check: -- Titles look reasonable (not empty, not garbage) -- Icon URLs are well-formed (absolute, correct protocol) -- iframe_allowed is set correctly (spot-check against real sites) - -**Done when:** Can parse 100 WARC records correctly with `--dry-run` showing reasonable output. - -### Step 2.3: Batch DB Writes + Full 100K Run - -Implement the database write path: -1. For each parsed host: UPDATE hosts SET html_title, iframe_allowed, parsed = TRUE -2. For each host: INSERT INTO icons (host_id, url, source='favicon_ico') for `/favicon.ico` -3. For each discovered link rel=icon: INSERT INTO icons (host_id, url, source='link_rel', rel_type, rel_sizes) -4. Use batch/bulk operations (pgx CopyFrom or batch INSERT) - -Run against the full 100K hosts: -- Monitor throughput (hosts/sec) -- Watch for errors (log to stderr) - -**Validation:** -- `SELECT COUNT(*) FROM hosts WHERE parsed = TRUE;` should approach 100,000 -- `SELECT COUNT(*) FROM icons;` should be > 100,000 (at minimum one /favicon.ico per host) -- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL;` — expect 90%+ -- `SELECT source, COUNT(*) FROM icons GROUP BY source;` — see the split -- `SELECT COUNT(*) FROM hosts WHERE iframe_allowed = FALSE;` — expect 30-50% -- Spot-check: pick some hosts, verify title matches the actual site - -**Stats:** `stats/02_warc_parse.json` - -**Done when:** All 100K hosts parsed, icons table populated, stats look reasonable. +**Result (100K hosts, concurrency 500):** +- Duration: 5m31s (~300 hosts/sec) +- Titles found: 93,384 (93%) +- Icons found: 201,780 (~2 per host) +- Iframe blocked: 17,855 (18%) +- Fetch errors: 3 +- DB errors: 0 +- Panics: 0 --- @@ -694,3 +631,31 @@ On completion, each program prints a summary line and writes its stats JSON (wit - CC-Index stores `url_port` as NULL for standard ports (80/443), not as the integer. Always check actual column values before writing filters. - c5.xlarge (8GB) is tight for this query — uses 6.4GB + swap. For the full 30M run, use c5.2xlarge (16GB). - Query takes ~692s (11.5 min) for 100K output rows reading all 300 parquet files. Full run without LIMIT will be similar duration but more memory for the hash table. + +### Phase 2 — Completed 2026-05-17 + +**Changes from original plan:** +- Used AWS SDK S3 GetObject for WARC byte-range requests instead of HTTPS to `data.commoncrawl.org`. The HTTPS endpoint rate-limits at ~100 concurrent connections (429s). S3 has no such limit. +- Removed progress bar — it interfered with per-host log lines. Replaced with clean stdout log lines + summary at end. Check DB for mid-run progress. +- Added `process.go` and `log.go` files (plan had 4 files, we have 6 — cleaner separation). +- Added charset detection + UTF-8 conversion (`golang.org/x/net/html/charset` + `golang.org/x/text/transform`) for international titles. +- Added `strings.ToValidUTF8` sanitization as final safety net for titles that still have invalid bytes after charset conversion. +- Panic recovery per goroutine — logs `PANIC:` prefix, doesn't mark row as parsed (retryable on next run). +- DB write errors tracked separately (`DB_ERROR:` prefix, counted in summary + stats JSON). + +**Lessons learned:** +- `data.commoncrawl.org` aggressively rate-limits (403/429) at ~100 concurrent connections. Use S3 API directly for high-concurrency access. +- Many Chinese/Japanese sites serve GBK or other non-UTF-8 encodings without declaring it in Content-Type or `<meta>`. `charset.DetermineEncoding` catches most but not all. `strings.ToValidUTF8` as final sanitization prevents Postgres encoding errors. +- gowarc's `HttpHeader()` can return nil for malformed records — always nil-check library return values defensively. +- Increasing concurrency from 100 to 500 didn't improve throughput (~300 hosts/sec either way). The bottleneck is likely Postgres write latency or S3 per-connection bandwidth, not parallelism. Could investigate batch inserts for the full run. +- Progress bars and per-item log lines don't mix well in terminals. Pick one or write progress to a separate channel (file, stderr). + +--- + +## Future Improvements + +- **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors. +- **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency. +- **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate). +- **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run. +- **Encoding: investigate remaining garbled titles** — Some titles still show `�` in output (e.g., `BERGSTRANDS BAGERI �...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.