updated PLAN.md finished with phase 2

2026-05-17 20:37:38 -04:00 · 2026-05-17 20:37:38 -04:00 · 8b5693b5c6
commit 8b5693b5c6
parent f45e4a6034
1 changed files with 52 additions and 87 deletions
--- a/PLAN.md
+++ b/PLAN.md
@ -114,102 +114,39 @@ Deduplication via `GROUP BY url_host_name` with `first(... ORDER BY ...)` aggreg

 **Done when:** 100K hosts in the database with valid WARC coordinates.

-### Step 1.3: Validate WARC Coordinates
+### Step 1.3: Validate WARC Coordinates [COMPLETED]

-Quick sanity check — before writing the full WARC parser, confirm we can actually fetch WARC records:
-
-```bash
-# Pick a random row
-psql -c "SELECT warc_filename, warc_record_offset, warc_record_length FROM hosts ORDER BY random() LIMIT 1;"
-
-# Fetch it with curl byte-range
-curl -r $OFFSET-$((OFFSET + LENGTH - 1)) "https://data.commoncrawl.org/$WARC_FILENAME" | head -c 500
-```
-
-Should see a WARC record header followed by HTTP response headers and HTML.
-
-**Done when:** We can manually fetch 3-5 WARC records and see valid HTML content.
+Manually fetched WARC records with curl byte-range requests to `data.commoncrawl.org`. Confirmed valid WARC headers, HTTP response, and HTML with `<title>` and `<link rel="icon">` tags.

 ---

-## Phase 2: WARC Parsing (Stage 2)
+## Phase 2: WARC Parsing (Stage 2) [COMPLETED]

-### Step 2.1: Go Project Setup
+### Steps 2.1-2.3 [COMPLETED]

-Set up the shared Go module and the WARC parser binary:
+Binary: `pipeline/02_warc_parse/` (5 files: main.go, warc.go, parser.go, process.go, db.go, log.go)

-```
-pipeline/02_warc_parse/
-├── main.go          # Entry point, CLI flags, orchestration
-├── warc.go          # WARC record fetching (S3 byte-range)
-├── parser.go        # HTML parsing (title, link rel=icon, iframe headers)
-└── db.go            # Postgres batch read/write
-```
+**Architecture:**
+- Fetches WARC records via AWS SDK S3 byte-range GetObject (using EC2 instance profile credentials)
+- Parses WARC records with `github.com/nlnwa/gowarc/v3`
+- Parses HTML with `golang.org/x/net/html` tokenizer (lenient, stops at `<body>`)
+- Detects charset via `golang.org/x/net/html/charset` and converts to UTF-8
+- Sanitizes titles with `strings.ToValidUTF8` as final safety net
+- Concurrent goroutine pool with configurable concurrency
+- Per-host log lines to stdout + optional log file
+- Panic recovery per goroutine (logs PANIC, doesn't mark row as parsed)
+- DB errors tracked and logged with `DB_ERROR:` prefix

-Dependencies:
- `github.com/nlnwa/gowarc/v3` — WARC record parser (actively maintained, v3.1.0, handles record envelope + HTTP response extraction correctly)
- `github.com/jackc/pgx/v5` — Postgres driver (pool, batch operations)
- `golang.org/x/net/html` — Lenient HTML parser
- `github.com/schollz/progressbar/v3` — Progress bar with ETA, rate, counters
- Standard library `net/http` for S3 byte-range requests
+**CLI:** `./warc_parse --db URL [--concurrency N] [--batch-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]`

-CLI flags:
- `--db` connection string
- `--batch-size` (default 500)
- `--concurrency` (default 1000)
- `--dry-run` (print parsed results, don't write to DB)
- `--limit` (process at most N rows, for testing)
-
-All Go programs display a live progress bar showing: items processed, items/sec, ETA, error count. On completion, print a summary with total duration.
-
-**Done when:** Project compiles, connects to DB, can read a batch of hosts rows.
-
-### Step 2.2: WARC Fetch + Parse Logic
-
-Implement:
-1. Byte-range fetch from `https://data.commoncrawl.org/{warc_filename}`
-2. Parse WARC record envelope (find the HTTP response within)
-3. Extract HTTP response headers:
-   - `X-Frame-Options` → if present and not `ALLOWALL`, iframe_allowed = false
-   - `Content-Security-Policy` → check for `frame-ancestors` directive
-4. Parse HTML body:
-   - Extract `<title>` content (first title tag, truncate at 512 chars)
-   - Extract all `<link rel="icon">` and `<link rel="shortcut icon">`:
-     - href (resolve relative URLs against `{protocol}://{hostname}/`)
-     - type attribute (if present)
-     - sizes attribute (if present)
-   - Ignore data: URIs, ignore links to other domains' icons for now
-
-**Dry-run test:** Run with `--limit 100 --dry-run` and inspect output. Check:
- Titles look reasonable (not empty, not garbage)
- Icon URLs are well-formed (absolute, correct protocol)
- iframe_allowed is set correctly (spot-check against real sites)
-
-**Done when:** Can parse 100 WARC records correctly with `--dry-run` showing reasonable output.
-
-### Step 2.3: Batch DB Writes + Full 100K Run
-
-Implement the database write path:
-1. For each parsed host: UPDATE hosts SET html_title, iframe_allowed, parsed = TRUE
-2. For each host: INSERT INTO icons (host_id, url, source='favicon_ico') for `/favicon.ico`
-3. For each discovered link rel=icon: INSERT INTO icons (host_id, url, source='link_rel', rel_type, rel_sizes)
-4. Use batch/bulk operations (pgx CopyFrom or batch INSERT)
-
-Run against the full 100K hosts:
- Monitor throughput (hosts/sec)
- Watch for errors (log to stderr)
-
-**Validation:**
- `SELECT COUNT(*) FROM hosts WHERE parsed = TRUE;` should approach 100,000
- `SELECT COUNT(*) FROM icons;` should be > 100,000 (at minimum one /favicon.ico per host)
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL;` — expect 90%+
- `SELECT source, COUNT(*) FROM icons GROUP BY source;` — see the split
- `SELECT COUNT(*) FROM hosts WHERE iframe_allowed = FALSE;` — expect 30-50%
- Spot-check: pick some hosts, verify title matches the actual site
-
-**Stats:** `stats/02_warc_parse.json`
-
-**Done when:** All 100K hosts parsed, icons table populated, stats look reasonable.
+**Result (100K hosts, concurrency 500):**
+- Duration: 5m31s (~300 hosts/sec)
+- Titles found: 93,384 (93%)
+- Icons found: 201,780 (~2 per host)
+- Iframe blocked: 17,855 (18%)
+- Fetch errors: 3
+- DB errors: 0
+- Panics: 0

 ---

@ -694,3 +631,31 @@ On completion, each program prints a summary line and writes its stats JSON (wit
 - CC-Index stores `url_port` as NULL for standard ports (80/443), not as the integer. Always check actual column values before writing filters.
 - c5.xlarge (8GB) is tight for this query — uses 6.4GB + swap. For the full 30M run, use c5.2xlarge (16GB).
 - Query takes ~692s (11.5 min) for 100K output rows reading all 300 parquet files. Full run without LIMIT will be similar duration but more memory for the hash table.
+
+### Phase 2 — Completed 2026-05-17
+
+**Changes from original plan:**
+- Used AWS SDK S3 GetObject for WARC byte-range requests instead of HTTPS to `data.commoncrawl.org`. The HTTPS endpoint rate-limits at ~100 concurrent connections (429s). S3 has no such limit.
+- Removed progress bar — it interfered with per-host log lines. Replaced with clean stdout log lines + summary at end. Check DB for mid-run progress.
+- Added `process.go` and `log.go` files (plan had 4 files, we have 6 — cleaner separation).
+- Added charset detection + UTF-8 conversion (`golang.org/x/net/html/charset` + `golang.org/x/text/transform`) for international titles.
+- Added `strings.ToValidUTF8` sanitization as final safety net for titles that still have invalid bytes after charset conversion.
+- Panic recovery per goroutine — logs `PANIC:` prefix, doesn't mark row as parsed (retryable on next run).
+- DB write errors tracked separately (`DB_ERROR:` prefix, counted in summary + stats JSON).
+
+**Lessons learned:**
+- `data.commoncrawl.org` aggressively rate-limits (403/429) at ~100 concurrent connections. Use S3 API directly for high-concurrency access.
+- Many Chinese/Japanese sites serve GBK or other non-UTF-8 encodings without declaring it in Content-Type or `<meta>`. `charset.DetermineEncoding` catches most but not all. `strings.ToValidUTF8` as final sanitization prevents Postgres encoding errors.
+- gowarc's `HttpHeader()` can return nil for malformed records — always nil-check library return values defensively.
+- Increasing concurrency from 100 to 500 didn't improve throughput (~300 hosts/sec either way). The bottleneck is likely Postgres write latency or S3 per-connection bandwidth, not parallelism. Could investigate batch inserts for the full run.
+- Progress bars and per-item log lines don't mix well in terminals. Pick one or write progress to a separate channel (file, stderr).
+
+---
+
+## Future Improvements
+
+- **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors.
+- **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency.
+- **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate).
+- **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run.
+- **Encoding: investigate remaining garbled titles** — Some titles still show `<60>` in output (e.g., `BERGSTRANDS BAGERI <20>...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.