updated PLAN.md finished with phase 2

This commit is contained in:
Joe Lothan 2026-05-17 20:37:38 -04:00
parent f45e4a6034
commit 8b5693b5c6

139
PLAN.md
View file

@ -114,102 +114,39 @@ Deduplication via `GROUP BY url_host_name` with `first(... ORDER BY ...)` aggreg
**Done when:** 100K hosts in the database with valid WARC coordinates.
### Step 1.3: Validate WARC Coordinates
### Step 1.3: Validate WARC Coordinates [COMPLETED]
Quick sanity check — before writing the full WARC parser, confirm we can actually fetch WARC records:
```bash
# Pick a random row
psql -c "SELECT warc_filename, warc_record_offset, warc_record_length FROM hosts ORDER BY random() LIMIT 1;"
# Fetch it with curl byte-range
curl -r $OFFSET-$((OFFSET + LENGTH - 1)) "https://data.commoncrawl.org/$WARC_FILENAME" | head -c 500
```
Should see a WARC record header followed by HTTP response headers and HTML.
**Done when:** We can manually fetch 3-5 WARC records and see valid HTML content.
Manually fetched WARC records with curl byte-range requests to `data.commoncrawl.org`. Confirmed valid WARC headers, HTTP response, and HTML with `<title>` and `<link rel="icon">` tags.
---
## Phase 2: WARC Parsing (Stage 2)
## Phase 2: WARC Parsing (Stage 2) [COMPLETED]
### Step 2.1: Go Project Setup
### Steps 2.1-2.3 [COMPLETED]
Set up the shared Go module and the WARC parser binary:
Binary: `pipeline/02_warc_parse/` (5 files: main.go, warc.go, parser.go, process.go, db.go, log.go)
```
pipeline/02_warc_parse/
├── main.go # Entry point, CLI flags, orchestration
├── warc.go # WARC record fetching (S3 byte-range)
├── parser.go # HTML parsing (title, link rel=icon, iframe headers)
└── db.go # Postgres batch read/write
```
**Architecture:**
- Fetches WARC records via AWS SDK S3 byte-range GetObject (using EC2 instance profile credentials)
- Parses WARC records with `github.com/nlnwa/gowarc/v3`
- Parses HTML with `golang.org/x/net/html` tokenizer (lenient, stops at `<body>`)
- Detects charset via `golang.org/x/net/html/charset` and converts to UTF-8
- Sanitizes titles with `strings.ToValidUTF8` as final safety net
- Concurrent goroutine pool with configurable concurrency
- Per-host log lines to stdout + optional log file
- Panic recovery per goroutine (logs PANIC, doesn't mark row as parsed)
- DB errors tracked and logged with `DB_ERROR:` prefix
Dependencies:
- `github.com/nlnwa/gowarc/v3` — WARC record parser (actively maintained, v3.1.0, handles record envelope + HTTP response extraction correctly)
- `github.com/jackc/pgx/v5` — Postgres driver (pool, batch operations)
- `golang.org/x/net/html` — Lenient HTML parser
- `github.com/schollz/progressbar/v3` — Progress bar with ETA, rate, counters
- Standard library `net/http` for S3 byte-range requests
**CLI:** `./warc_parse --db URL [--concurrency N] [--batch-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]`
CLI flags:
- `--db` connection string
- `--batch-size` (default 500)
- `--concurrency` (default 1000)
- `--dry-run` (print parsed results, don't write to DB)
- `--limit` (process at most N rows, for testing)
All Go programs display a live progress bar showing: items processed, items/sec, ETA, error count. On completion, print a summary with total duration.
**Done when:** Project compiles, connects to DB, can read a batch of hosts rows.
### Step 2.2: WARC Fetch + Parse Logic
Implement:
1. Byte-range fetch from `https://data.commoncrawl.org/{warc_filename}`
2. Parse WARC record envelope (find the HTTP response within)
3. Extract HTTP response headers:
- `X-Frame-Options` → if present and not `ALLOWALL`, iframe_allowed = false
- `Content-Security-Policy` → check for `frame-ancestors` directive
4. Parse HTML body:
- Extract `<title>` content (first title tag, truncate at 512 chars)
- Extract all `<link rel="icon">` and `<link rel="shortcut icon">`:
- href (resolve relative URLs against `{protocol}://{hostname}/`)
- type attribute (if present)
- sizes attribute (if present)
- Ignore data: URIs, ignore links to other domains' icons for now
**Dry-run test:** Run with `--limit 100 --dry-run` and inspect output. Check:
- Titles look reasonable (not empty, not garbage)
- Icon URLs are well-formed (absolute, correct protocol)
- iframe_allowed is set correctly (spot-check against real sites)
**Done when:** Can parse 100 WARC records correctly with `--dry-run` showing reasonable output.
### Step 2.3: Batch DB Writes + Full 100K Run
Implement the database write path:
1. For each parsed host: UPDATE hosts SET html_title, iframe_allowed, parsed = TRUE
2. For each host: INSERT INTO icons (host_id, url, source='favicon_ico') for `/favicon.ico`
3. For each discovered link rel=icon: INSERT INTO icons (host_id, url, source='link_rel', rel_type, rel_sizes)
4. Use batch/bulk operations (pgx CopyFrom or batch INSERT)
Run against the full 100K hosts:
- Monitor throughput (hosts/sec)
- Watch for errors (log to stderr)
**Validation:**
- `SELECT COUNT(*) FROM hosts WHERE parsed = TRUE;` should approach 100,000
- `SELECT COUNT(*) FROM icons;` should be > 100,000 (at minimum one /favicon.ico per host)
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL;` — expect 90%+
- `SELECT source, COUNT(*) FROM icons GROUP BY source;` — see the split
- `SELECT COUNT(*) FROM hosts WHERE iframe_allowed = FALSE;` — expect 30-50%
- Spot-check: pick some hosts, verify title matches the actual site
**Stats:** `stats/02_warc_parse.json`
**Done when:** All 100K hosts parsed, icons table populated, stats look reasonable.
**Result (100K hosts, concurrency 500):**
- Duration: 5m31s (~300 hosts/sec)
- Titles found: 93,384 (93%)
- Icons found: 201,780 (~2 per host)
- Iframe blocked: 17,855 (18%)
- Fetch errors: 3
- DB errors: 0
- Panics: 0
---
@ -694,3 +631,31 @@ On completion, each program prints a summary line and writes its stats JSON (wit
- CC-Index stores `url_port` as NULL for standard ports (80/443), not as the integer. Always check actual column values before writing filters.
- c5.xlarge (8GB) is tight for this query — uses 6.4GB + swap. For the full 30M run, use c5.2xlarge (16GB).
- Query takes ~692s (11.5 min) for 100K output rows reading all 300 parquet files. Full run without LIMIT will be similar duration but more memory for the hash table.
### Phase 2 — Completed 2026-05-17
**Changes from original plan:**
- Used AWS SDK S3 GetObject for WARC byte-range requests instead of HTTPS to `data.commoncrawl.org`. The HTTPS endpoint rate-limits at ~100 concurrent connections (429s). S3 has no such limit.
- Removed progress bar — it interfered with per-host log lines. Replaced with clean stdout log lines + summary at end. Check DB for mid-run progress.
- Added `process.go` and `log.go` files (plan had 4 files, we have 6 — cleaner separation).
- Added charset detection + UTF-8 conversion (`golang.org/x/net/html/charset` + `golang.org/x/text/transform`) for international titles.
- Added `strings.ToValidUTF8` sanitization as final safety net for titles that still have invalid bytes after charset conversion.
- Panic recovery per goroutine — logs `PANIC:` prefix, doesn't mark row as parsed (retryable on next run).
- DB write errors tracked separately (`DB_ERROR:` prefix, counted in summary + stats JSON).
**Lessons learned:**
- `data.commoncrawl.org` aggressively rate-limits (403/429) at ~100 concurrent connections. Use S3 API directly for high-concurrency access.
- Many Chinese/Japanese sites serve GBK or other non-UTF-8 encodings without declaring it in Content-Type or `<meta>`. `charset.DetermineEncoding` catches most but not all. `strings.ToValidUTF8` as final sanitization prevents Postgres encoding errors.
- gowarc's `HttpHeader()` can return nil for malformed records — always nil-check library return values defensively.
- Increasing concurrency from 100 to 500 didn't improve throughput (~300 hosts/sec either way). The bottleneck is likely Postgres write latency or S3 per-connection bandwidth, not parallelism. Could investigate batch inserts for the full run.
- Progress bars and per-item log lines don't mix well in terminals. Pick one or write progress to a separate channel (file, stderr).
---
## Future Improvements
- **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors.
- **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency.
- **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate).
- **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run.
- **Encoding: investigate remaining garbled titles** — Some titles still show `<60>` in output (e.g., `BERGSTRANDS BAGERI <20>...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.