added timestamps, warc parser library, log files, progress bars, and testing the frontend with real data to the PLAN.md
This commit is contained in:
parent
c50be97fd7
commit
64ae58494b
2 changed files with 58 additions and 13 deletions
56
PLAN.md
56
PLAN.md
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
This plan builds the system described in ARCHITECTURE.md in incremental steps. We start with 100K hosts to validate the pipeline end-to-end, then scale to the full ~30M.
|
||||
|
||||
Each step has a clear deliverable and validation criteria. Steps within a phase are sequential; some phases can overlap (noted where applicable).
|
||||
Each step has a clear deliverable and validation criteria. Steps are sequential — each phase builds on the previous.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -167,7 +167,7 @@ Key considerations:
|
|||
- `SELECT * FROM hosts LIMIT 5;` shows sane data (real hostnames, valid WARC paths)
|
||||
- Spot-check: pick a few hostnames, verify they're real websites
|
||||
|
||||
**Stats to emit:** `stats/01_cc_index.json` with total_domains, https_count, http_count, query_time_seconds.
|
||||
**Stats to emit:** `stats/01_cc_index.json` — includes: started_at, finished_at, duration_seconds, total_domains, https_count, http_count.
|
||||
|
||||
**Done when:** 100K hosts in the database with valid WARC coordinates.
|
||||
|
||||
|
|
@ -204,8 +204,10 @@ pipeline/02_warc_parse/
|
|||
```
|
||||
|
||||
Dependencies:
|
||||
- `github.com/nlnwa/gowarc/v3` — WARC record parser (actively maintained, v3.1.0, handles record envelope + HTTP response extraction correctly)
|
||||
- `github.com/jackc/pgx/v5` — Postgres driver (pool, batch operations)
|
||||
- `golang.org/x/net/html` — Lenient HTML parser
|
||||
- `github.com/schollz/progressbar/v3` — Progress bar with ETA, rate, counters
|
||||
- Standard library `net/http` for S3 byte-range requests
|
||||
|
||||
CLI flags:
|
||||
|
|
@ -215,6 +217,8 @@ CLI flags:
|
|||
- `--dry-run` (print parsed results, don't write to DB)
|
||||
- `--limit` (process at most N rows, for testing)
|
||||
|
||||
All Go programs display a live progress bar showing: items processed, items/sec, ETA, error count. On completion, print a summary with total duration.
|
||||
|
||||
**Done when:** Project compiles, connects to DB, can read a batch of hosts rows.
|
||||
|
||||
### Step 2.2: WARC Fetch + Parse Logic
|
||||
|
|
@ -292,8 +296,10 @@ CLI flags:
|
|||
Dependencies:
|
||||
- `github.com/jackc/pgx/v5` — Postgres
|
||||
- `github.com/aws/aws-sdk-go-v2` — S3 uploads
|
||||
- Standard library `image` + sub-packages for decoding dimensions
|
||||
- A library for ICO parsing (e.g., `github.com/AvraamMavridis/randomcolor` — actually find a proper ICO decoder, or write a simple one that reads the ICO header for directory entries)
|
||||
- `github.com/schollz/progressbar/v3` — Progress bar
|
||||
- Standard library `image` + `image/png`, `image/gif`, `image/jpeg` for decoding dimensions
|
||||
- `golang.org/x/image/webp` — WebP decoding
|
||||
- ICO parsing: write a minimal decoder (ICO format is simple — 6-byte header + directory entries pointing to BMP/PNG data) or find a maintained library at implementation time
|
||||
|
||||
### Step 3.2: Work Claiming + Download Logic
|
||||
|
||||
|
|
@ -442,15 +448,20 @@ Implement:
|
|||
|
||||
## Phase 5: Frontend (Stage 6)
|
||||
|
||||
This phase can begin in parallel with Phase 3-4 using mock bundle data.
|
||||
Begins after Phase 4 is complete — we use real bundle data from the 100K pipeline run for frontend development.
|
||||
|
||||
### Step 5.1: Mock Data for Frontend Dev
|
||||
### Step 5.1: Local Dev Server
|
||||
|
||||
Generate 2-3 small mock bundle files (`tabs/0.json`, `tabs/1.json`, `tabs/2.json`) with ~20 entries each. Use real favicons (Google, GitHub, Wikipedia, etc.) manually base64-encoded. This lets us develop the frontend without waiting for the pipeline.
|
||||
Serve the generated bundles from S3 locally for frontend development:
|
||||
|
||||
Serve locally with any static file server (`python -m http.server`).
|
||||
```bash
|
||||
# Sync a few bundles locally for testing
|
||||
aws s3 sync s3://everytab-site/tabs/ ./local-tabs/ --max-items 10
|
||||
# Serve with any static file server
|
||||
python -m http.server 8000
|
||||
```
|
||||
|
||||
**Done when:** Mock bundles exist and can be served locally.
|
||||
**Done when:** Can fetch real bundle JSON from a local dev server.
|
||||
|
||||
### Step 5.2: Basic Tab Rendering
|
||||
|
||||
|
|
@ -661,11 +672,30 @@ aws ec2 terminate-instances --instance-ids i-xxxxx
|
|||
|
||||
## Development Notes
|
||||
|
||||
### What Can Be Parallelized
|
||||
### Execution Order
|
||||
|
||||
- **Frontend dev (Phase 5.1-5.5)** can happen at any time using mock data
|
||||
- **AWS infra setup (Phase 0.2)** can happen while writing code locally
|
||||
- **Icon downloader (Phase 3)** and **bundle generator (Phase 4)** are independent codebases, can be written in parallel
|
||||
Phases are sequential: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8. Frontend (Phase 5) uses real data from the 100K pipeline run. The only thing that can be developed ahead of time is writing Go code locally before EC2 is ready (compile-test locally, run on EC2).
|
||||
|
||||
### Progress & Observability
|
||||
|
||||
All Go programs have two output modes running simultaneously:
|
||||
|
||||
**Per-item log lines** (stdout, above the progress bar):
|
||||
- WARC parser: `parsed: example.com 200 "Example Domai..." ok` or `parsed: broken.net 200 "" err:no_title`
|
||||
- Icon downloader: `icon: https://example.com/favicon.ico 32x32 png 4.2KB ok` or `icon: https://fail.org/favicon.ico err:timeout`
|
||||
- Bundle generator: `bundle: 0042.json 120 entries 247KB ok`
|
||||
|
||||
Each line is a short, fixed-format summary — hostname/URL, key result, and status. Keeps it scannable when running live.
|
||||
|
||||
**Log file** (`--log-file path/to/out.log`): If provided, mirror all per-item log lines to disk. For full-scale runs, consider using `--log-errors-only` flag to only write error lines to the log file (avoids filling disk with 30M success lines). Without `--log-file`, logs only go to stdout.
|
||||
|
||||
**Progress bar** (bottom of terminal, `schollz/progressbar`):
|
||||
- Items processed / total items
|
||||
- Processing rate (items/sec)
|
||||
- ETA
|
||||
- Error count
|
||||
|
||||
On completion, each program prints a summary line and writes its stats JSON (with started_at, finished_at, duration_seconds, and stage-specific counters).
|
||||
|
||||
### Testing Strategy
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue