updated PLAN.md finished with phase 2
This commit is contained in:
parent
f45e4a6034
commit
8b5693b5c6
1 changed files with 52 additions and 87 deletions
139
PLAN.md
139
PLAN.md
|
|
@ -114,102 +114,39 @@ Deduplication via `GROUP BY url_host_name` with `first(... ORDER BY ...)` aggreg
|
|||
|
||||
**Done when:** 100K hosts in the database with valid WARC coordinates.
|
||||
|
||||
### Step 1.3: Validate WARC Coordinates
|
||||
### Step 1.3: Validate WARC Coordinates [COMPLETED]
|
||||
|
||||
Quick sanity check — before writing the full WARC parser, confirm we can actually fetch WARC records:
|
||||
|
||||
```bash
|
||||
# Pick a random row
|
||||
psql -c "SELECT warc_filename, warc_record_offset, warc_record_length FROM hosts ORDER BY random() LIMIT 1;"
|
||||
|
||||
# Fetch it with curl byte-range
|
||||
curl -r $OFFSET-$((OFFSET + LENGTH - 1)) "https://data.commoncrawl.org/$WARC_FILENAME" | head -c 500
|
||||
```
|
||||
|
||||
Should see a WARC record header followed by HTTP response headers and HTML.
|
||||
|
||||
**Done when:** We can manually fetch 3-5 WARC records and see valid HTML content.
|
||||
Manually fetched WARC records with curl byte-range requests to `data.commoncrawl.org`. Confirmed valid WARC headers, HTTP response, and HTML with `<title>` and `<link rel="icon">` tags.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: WARC Parsing (Stage 2)
|
||||
## Phase 2: WARC Parsing (Stage 2) [COMPLETED]
|
||||
|
||||
### Step 2.1: Go Project Setup
|
||||
### Steps 2.1-2.3 [COMPLETED]
|
||||
|
||||
Set up the shared Go module and the WARC parser binary:
|
||||
Binary: `pipeline/02_warc_parse/` (5 files: main.go, warc.go, parser.go, process.go, db.go, log.go)
|
||||
|
||||
```
|
||||
pipeline/02_warc_parse/
|
||||
├── main.go # Entry point, CLI flags, orchestration
|
||||
├── warc.go # WARC record fetching (S3 byte-range)
|
||||
├── parser.go # HTML parsing (title, link rel=icon, iframe headers)
|
||||
└── db.go # Postgres batch read/write
|
||||
```
|
||||
**Architecture:**
|
||||
- Fetches WARC records via AWS SDK S3 byte-range GetObject (using EC2 instance profile credentials)
|
||||
- Parses WARC records with `github.com/nlnwa/gowarc/v3`
|
||||
- Parses HTML with `golang.org/x/net/html` tokenizer (lenient, stops at `<body>`)
|
||||
- Detects charset via `golang.org/x/net/html/charset` and converts to UTF-8
|
||||
- Sanitizes titles with `strings.ToValidUTF8` as final safety net
|
||||
- Concurrent goroutine pool with configurable concurrency
|
||||
- Per-host log lines to stdout + optional log file
|
||||
- Panic recovery per goroutine (logs PANIC, doesn't mark row as parsed)
|
||||
- DB errors tracked and logged with `DB_ERROR:` prefix
|
||||
|
||||
Dependencies:
|
||||
- `github.com/nlnwa/gowarc/v3` — WARC record parser (actively maintained, v3.1.0, handles record envelope + HTTP response extraction correctly)
|
||||
- `github.com/jackc/pgx/v5` — Postgres driver (pool, batch operations)
|
||||
- `golang.org/x/net/html` — Lenient HTML parser
|
||||
- `github.com/schollz/progressbar/v3` — Progress bar with ETA, rate, counters
|
||||
- Standard library `net/http` for S3 byte-range requests
|
||||
**CLI:** `./warc_parse --db URL [--concurrency N] [--batch-size N] [--limit N] [--dry-run] [--log-file PATH] [--log-errors-only]`
|
||||
|
||||
CLI flags:
|
||||
- `--db` connection string
|
||||
- `--batch-size` (default 500)
|
||||
- `--concurrency` (default 1000)
|
||||
- `--dry-run` (print parsed results, don't write to DB)
|
||||
- `--limit` (process at most N rows, for testing)
|
||||
|
||||
All Go programs display a live progress bar showing: items processed, items/sec, ETA, error count. On completion, print a summary with total duration.
|
||||
|
||||
**Done when:** Project compiles, connects to DB, can read a batch of hosts rows.
|
||||
|
||||
### Step 2.2: WARC Fetch + Parse Logic
|
||||
|
||||
Implement:
|
||||
1. Byte-range fetch from `https://data.commoncrawl.org/{warc_filename}`
|
||||
2. Parse WARC record envelope (find the HTTP response within)
|
||||
3. Extract HTTP response headers:
|
||||
- `X-Frame-Options` → if present and not `ALLOWALL`, iframe_allowed = false
|
||||
- `Content-Security-Policy` → check for `frame-ancestors` directive
|
||||
4. Parse HTML body:
|
||||
- Extract `<title>` content (first title tag, truncate at 512 chars)
|
||||
- Extract all `<link rel="icon">` and `<link rel="shortcut icon">`:
|
||||
- href (resolve relative URLs against `{protocol}://{hostname}/`)
|
||||
- type attribute (if present)
|
||||
- sizes attribute (if present)
|
||||
- Ignore data: URIs, ignore links to other domains' icons for now
|
||||
|
||||
**Dry-run test:** Run with `--limit 100 --dry-run` and inspect output. Check:
|
||||
- Titles look reasonable (not empty, not garbage)
|
||||
- Icon URLs are well-formed (absolute, correct protocol)
|
||||
- iframe_allowed is set correctly (spot-check against real sites)
|
||||
|
||||
**Done when:** Can parse 100 WARC records correctly with `--dry-run` showing reasonable output.
|
||||
|
||||
### Step 2.3: Batch DB Writes + Full 100K Run
|
||||
|
||||
Implement the database write path:
|
||||
1. For each parsed host: UPDATE hosts SET html_title, iframe_allowed, parsed = TRUE
|
||||
2. For each host: INSERT INTO icons (host_id, url, source='favicon_ico') for `/favicon.ico`
|
||||
3. For each discovered link rel=icon: INSERT INTO icons (host_id, url, source='link_rel', rel_type, rel_sizes)
|
||||
4. Use batch/bulk operations (pgx CopyFrom or batch INSERT)
|
||||
|
||||
Run against the full 100K hosts:
|
||||
- Monitor throughput (hosts/sec)
|
||||
- Watch for errors (log to stderr)
|
||||
|
||||
**Validation:**
|
||||
- `SELECT COUNT(*) FROM hosts WHERE parsed = TRUE;` should approach 100,000
|
||||
- `SELECT COUNT(*) FROM icons;` should be > 100,000 (at minimum one /favicon.ico per host)
|
||||
- `SELECT COUNT(*) FROM hosts WHERE html_title IS NOT NULL;` — expect 90%+
|
||||
- `SELECT source, COUNT(*) FROM icons GROUP BY source;` — see the split
|
||||
- `SELECT COUNT(*) FROM hosts WHERE iframe_allowed = FALSE;` — expect 30-50%
|
||||
- Spot-check: pick some hosts, verify title matches the actual site
|
||||
|
||||
**Stats:** `stats/02_warc_parse.json`
|
||||
|
||||
**Done when:** All 100K hosts parsed, icons table populated, stats look reasonable.
|
||||
**Result (100K hosts, concurrency 500):**
|
||||
- Duration: 5m31s (~300 hosts/sec)
|
||||
- Titles found: 93,384 (93%)
|
||||
- Icons found: 201,780 (~2 per host)
|
||||
- Iframe blocked: 17,855 (18%)
|
||||
- Fetch errors: 3
|
||||
- DB errors: 0
|
||||
- Panics: 0
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -694,3 +631,31 @@ On completion, each program prints a summary line and writes its stats JSON (wit
|
|||
- CC-Index stores `url_port` as NULL for standard ports (80/443), not as the integer. Always check actual column values before writing filters.
|
||||
- c5.xlarge (8GB) is tight for this query — uses 6.4GB + swap. For the full 30M run, use c5.2xlarge (16GB).
|
||||
- Query takes ~692s (11.5 min) for 100K output rows reading all 300 parquet files. Full run without LIMIT will be similar duration but more memory for the hash table.
|
||||
|
||||
### Phase 2 — Completed 2026-05-17
|
||||
|
||||
**Changes from original plan:**
|
||||
- Used AWS SDK S3 GetObject for WARC byte-range requests instead of HTTPS to `data.commoncrawl.org`. The HTTPS endpoint rate-limits at ~100 concurrent connections (429s). S3 has no such limit.
|
||||
- Removed progress bar — it interfered with per-host log lines. Replaced with clean stdout log lines + summary at end. Check DB for mid-run progress.
|
||||
- Added `process.go` and `log.go` files (plan had 4 files, we have 6 — cleaner separation).
|
||||
- Added charset detection + UTF-8 conversion (`golang.org/x/net/html/charset` + `golang.org/x/text/transform`) for international titles.
|
||||
- Added `strings.ToValidUTF8` sanitization as final safety net for titles that still have invalid bytes after charset conversion.
|
||||
- Panic recovery per goroutine — logs `PANIC:` prefix, doesn't mark row as parsed (retryable on next run).
|
||||
- DB write errors tracked separately (`DB_ERROR:` prefix, counted in summary + stats JSON).
|
||||
|
||||
**Lessons learned:**
|
||||
- `data.commoncrawl.org` aggressively rate-limits (403/429) at ~100 concurrent connections. Use S3 API directly for high-concurrency access.
|
||||
- Many Chinese/Japanese sites serve GBK or other non-UTF-8 encodings without declaring it in Content-Type or `<meta>`. `charset.DetermineEncoding` catches most but not all. `strings.ToValidUTF8` as final sanitization prevents Postgres encoding errors.
|
||||
- gowarc's `HttpHeader()` can return nil for malformed records — always nil-check library return values defensively.
|
||||
- Increasing concurrency from 100 to 500 didn't improve throughput (~300 hosts/sec either way). The bottleneck is likely Postgres write latency or S3 per-connection bandwidth, not parallelism. Could investigate batch inserts for the full run.
|
||||
- Progress bars and per-item log lines don't mix well in terminals. Pick one or write progress to a separate channel (file, stderr).
|
||||
|
||||
---
|
||||
|
||||
## Future Improvements
|
||||
|
||||
- **WARC parser: retry on fetch errors** — Currently 3 fetch errors out of 100K (tolerable loss). Could add 1 retry with backoff for transient S3 errors.
|
||||
- **WARC parser: batch DB inserts** — Currently one INSERT per icon. Using pgx batch or CopyFrom could improve DB write throughput and potentially unblock higher concurrency.
|
||||
- **WARC parser: investigate throughput ceiling** — 300 hosts/sec at both 100 and 500 concurrency suggests a bottleneck. Profile to determine if it's S3 response latency, Postgres writes, or something else. For the full 30M run this determines wall-clock time (~28 hours at current rate).
|
||||
- **CC-Index query: c5.2xlarge for full run** — 8GB is tight with 6.4GB usage + swap. 16GB instance for the 30M-host full run.
|
||||
- **Encoding: investigate remaining garbled titles** — Some titles still show `<60>` in output (e.g., `BERGSTRANDS BAGERI <20>...`). These are pages that lie about their encoding. Could try more aggressive charset detection heuristics.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue