added infra setup with terraform

This commit is contained in:
Joe Lothan 2026-05-17 16:07:50 -04:00
parent 64ae58494b
commit fcf203e1d8
8 changed files with 556 additions and 74 deletions

127
PLAN.md
View file

@ -6,95 +6,55 @@ Each step has a clear deliverable and validation criteria. Steps are sequential
---
## Phase 0: Project Setup & AWS Infrastructure
## Phase 0: Project Setup & AWS Infrastructure [COMPLETED]
### Step 0.1: Repository Structure
Create the project layout:
### Step 0.1: Repository Structure [COMPLETED]
```
everytab/
├── design.md
├── ARCHITECTURE.md
├── PLAN.md
├── infra/ # AWS CLI scripts for setup/teardown
│ ├── setup.sh # Create RDS, S3 buckets, security groups
│ ├── teardown.sh # Delete non-permanent resources
│ └── ec2-userdata.sh # EC2 bootstrap (install Go, DuckDB, Unbound)
├── infra/
│ ├── main.tf # Terraform: all AWS resources
│ ├── terraform.tfvars.example
│ ├── ec2-userdata.sh # EC2 bootstrap (Go, DuckDB, Unbound)
│ └── README.md # Setup steps
├── pipeline/
│ ├── 01_cc_index/ # DuckDB query scripts
│ ├── 02_warc_parse/ # Go program
│ ├── 03_icon_download/# Go program
│ ├── 04_best_icon/ # SQL script
│ ├── 05_bundle_gen/ # Go program
│ └── 06_frontend/ # Build script, templates
│ ├── 01_cc_index/
│ │ └── schema.sql # Postgres table definitions
│ ├── 02_warc_parse/
│ ├── 03_icon_download/
│ ├── 04_best_icon/
│ ├── 05_bundle_gen/
│ └── 06_frontend/
├── frontend/
│ ├── index.html
│ └── site.js
├── stats/ # Stats output from each stage (gitignored)
└── go.mod # Shared Go module for pipeline programs
├── stats/ # gitignored
└── go.mod
```
**Done when:** Repo structure exists, `go.mod` initialized, `.gitignore` covers stats/ and any local config.
### Step 0.2: AWS Infrastructure (Terraform) [COMPLETED]
### Step 0.2: AWS Infrastructure (Manual CLI)
Infrastructure managed via `infra/main.tf`. Single file, uses `var.scanning` bool to switch phases:
- `terraform apply` — creates all scanning resources (EC2, RDS, S3 icons, S3 site, IAM, security groups)
- `terraform apply -var="scanning=false"` — destroys scanning resources, keeps site bucket
- `terraform destroy` — removes everything
Create resources using AWS CLI commands in `infra/setup.sh`:
Resources created:
- S3 `everytab-icons` (private), S3 `everytab-site` (for CloudFront later)
- RDS Postgres 16, db.t3.medium, 20GB gp3
- EC2 c5.xlarge, Amazon Linux 2023, 50GB gp3
- Security groups (SSH from home IP, RDS from EC2 only)
- IAM role + instance profile (S3 access only)
- SSH key (Terraform-managed ed25519)
1. **S3 buckets:**
- `everytab-icons` (private, no public access)
- `everytab-site` (private, accessed via CloudFront OAC)
### Step 0.3: EC2 Environment Setup [COMPLETED]
2. **RDS Postgres:**
- `db.t3.medium`, 20GB storage (expandable), Postgres 16
- In a VPC, security group allows inbound 5432 from EC2 security group
- No public access (EC2 connects within VPC)
- No multi-AZ (dev, not production)
- Set a strong password, store in a local `.env` (gitignored)
3. **EC2 instance:**
- `c5.xlarge` (4 vCPU, 8GB RAM) — enough for Go concurrency + Unbound cache
- Amazon Linux 2023 or Ubuntu 24.04
- Security group: allow SSH (from your IP), allow outbound all
- Same VPC/subnet as RDS
- Key pair for SSH access
4. **CloudFront distribution:**
- Origin: `everytab-site` S3 bucket (OAC)
- Default cache behavior: cache everything, Brotli+Gzip compression
- Can set up now or defer to Phase 2
5. **IAM role for EC2:**
- S3 read/write to both buckets
- Attach as instance profile
**Validation:** SSH into EC2, confirm `psql` can connect to RDS, confirm `aws s3 ls` shows both buckets.
**Done when:** All resources exist, EC2 can reach RDS and S3.
### Step 0.3: EC2 Environment Setup
Bootstrap script (`infra/ec2-userdata.sh` or run manually):
1. Install Go (latest stable, 1.22+)
2. Install DuckDB CLI
3. Install Unbound, configure as recursive resolver:
- `/etc/unbound/unbound.conf`: recursive mode, no forwarding, listen on 127.0.0.1
- High cache: `msg-cache-size: 512m`, `rrset-cache-size: 1g`
- `cache-min-ttl: 3600`
- `prefetch: yes`
- `num-threads: 4`
4. Set `/etc/resolv.conf``nameserver 127.0.0.1`
5. Install `psql` client, `pg_dump`
6. Confirm DuckDB httpfs extension works: `INSTALL httpfs; LOAD httpfs;`
**Validation:**
- `go version` works
- `duckdb -c "INSTALL httpfs; LOAD httpfs; SELECT 1;"` works
- `dig example.com @127.0.0.1` resolves (Unbound working)
- `psql $DATABASE_URL -c "SELECT 1;"` connects to RDS
**Done when:** EC2 is a working development environment for all pipeline stages.
Bootstrap via `infra/ec2-userdata.sh`:
- Go 1.22+, DuckDB (httpfs + postgres extensions), Unbound (recursive resolver), psql, tmux
- Unbound configured as system resolver (systemd-resolved disabled)
- DATABASE_URL in .bashrc
- Schema applied: hosts + icons tables with indexes
---
@ -715,3 +675,22 @@ On completion, each program prints a summary line and writes its stats JSON (wit
- **Postgres connection limits:** RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections.
- **S3 eventual consistency:** After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash).
- **CloudFront caching:** After deploying new bundles, invalidate `/*` or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls).
---
## Progress Log
### Phase 0 — Completed 2026-05-17
**Changes from original plan:**
- Replaced shell scripts (`setup.sh`, `teardown.sh`) with Terraform (`infra/main.tf`). Single file, `var.scanning` bool switches between scanning and serving phases.
- SSH key is Terraform-managed (no passphrase, stored in state) rather than manually generated.
- CloudFront distribution deferred — not created in Phase 0, will add to Terraform when frontend is ready.
- Added `infra/README.md` with terse setup steps for future replication.
**Lessons learned:**
- Shell scripts with `2>/dev/null || echo "already exists"` swallow real errors. Terraform's declarative model avoids this entirely — errors are always surfaced.
- RDS requires a DB subnet group (2+ subnets in different AZs). The original shell script didn't create one, causing a silent failure. Terraform handles this dependency automatically.
- Amazon Linux 2023 uses `systemd-resolved` which manages `/etc/resolv.conf`. Must disable it before pointing resolv.conf at Unbound. `chattr +i` doesn't work on the symlink.
- AWS EC2 key pairs created via API don't support passphrases. Use `tls_private_key` in Terraform or generate locally with `ssh-keygen` + import.
- When an AWS key pair name already exists from a previous run, Terraform may not regenerate it. Use `-replace` to force recreation of the key + instance together.