added infra setup with terraform
This commit is contained in:
parent
64ae58494b
commit
fcf203e1d8
8 changed files with 556 additions and 74 deletions
127
PLAN.md
127
PLAN.md
|
|
@ -6,95 +6,55 @@ Each step has a clear deliverable and validation criteria. Steps are sequential
|
|||
|
||||
---
|
||||
|
||||
## Phase 0: Project Setup & AWS Infrastructure
|
||||
## Phase 0: Project Setup & AWS Infrastructure [COMPLETED]
|
||||
|
||||
### Step 0.1: Repository Structure
|
||||
|
||||
Create the project layout:
|
||||
### Step 0.1: Repository Structure [COMPLETED]
|
||||
|
||||
```
|
||||
everytab/
|
||||
├── design.md
|
||||
├── ARCHITECTURE.md
|
||||
├── PLAN.md
|
||||
├── infra/ # AWS CLI scripts for setup/teardown
|
||||
│ ├── setup.sh # Create RDS, S3 buckets, security groups
|
||||
│ ├── teardown.sh # Delete non-permanent resources
|
||||
│ └── ec2-userdata.sh # EC2 bootstrap (install Go, DuckDB, Unbound)
|
||||
├── infra/
|
||||
│ ├── main.tf # Terraform: all AWS resources
|
||||
│ ├── terraform.tfvars.example
|
||||
│ ├── ec2-userdata.sh # EC2 bootstrap (Go, DuckDB, Unbound)
|
||||
│ └── README.md # Setup steps
|
||||
├── pipeline/
|
||||
│ ├── 01_cc_index/ # DuckDB query scripts
|
||||
│ ├── 02_warc_parse/ # Go program
|
||||
│ ├── 03_icon_download/# Go program
|
||||
│ ├── 04_best_icon/ # SQL script
|
||||
│ ├── 05_bundle_gen/ # Go program
|
||||
│ └── 06_frontend/ # Build script, templates
|
||||
│ ├── 01_cc_index/
|
||||
│ │ └── schema.sql # Postgres table definitions
|
||||
│ ├── 02_warc_parse/
|
||||
│ ├── 03_icon_download/
|
||||
│ ├── 04_best_icon/
|
||||
│ ├── 05_bundle_gen/
|
||||
│ └── 06_frontend/
|
||||
├── frontend/
|
||||
│ ├── index.html
|
||||
│ └── site.js
|
||||
├── stats/ # Stats output from each stage (gitignored)
|
||||
└── go.mod # Shared Go module for pipeline programs
|
||||
├── stats/ # gitignored
|
||||
└── go.mod
|
||||
```
|
||||
|
||||
**Done when:** Repo structure exists, `go.mod` initialized, `.gitignore` covers stats/ and any local config.
|
||||
### Step 0.2: AWS Infrastructure (Terraform) [COMPLETED]
|
||||
|
||||
### Step 0.2: AWS Infrastructure (Manual CLI)
|
||||
Infrastructure managed via `infra/main.tf`. Single file, uses `var.scanning` bool to switch phases:
|
||||
- `terraform apply` — creates all scanning resources (EC2, RDS, S3 icons, S3 site, IAM, security groups)
|
||||
- `terraform apply -var="scanning=false"` — destroys scanning resources, keeps site bucket
|
||||
- `terraform destroy` — removes everything
|
||||
|
||||
Create resources using AWS CLI commands in `infra/setup.sh`:
|
||||
Resources created:
|
||||
- S3 `everytab-icons` (private), S3 `everytab-site` (for CloudFront later)
|
||||
- RDS Postgres 16, db.t3.medium, 20GB gp3
|
||||
- EC2 c5.xlarge, Amazon Linux 2023, 50GB gp3
|
||||
- Security groups (SSH from home IP, RDS from EC2 only)
|
||||
- IAM role + instance profile (S3 access only)
|
||||
- SSH key (Terraform-managed ed25519)
|
||||
|
||||
1. **S3 buckets:**
|
||||
- `everytab-icons` (private, no public access)
|
||||
- `everytab-site` (private, accessed via CloudFront OAC)
|
||||
### Step 0.3: EC2 Environment Setup [COMPLETED]
|
||||
|
||||
2. **RDS Postgres:**
|
||||
- `db.t3.medium`, 20GB storage (expandable), Postgres 16
|
||||
- In a VPC, security group allows inbound 5432 from EC2 security group
|
||||
- No public access (EC2 connects within VPC)
|
||||
- No multi-AZ (dev, not production)
|
||||
- Set a strong password, store in a local `.env` (gitignored)
|
||||
|
||||
3. **EC2 instance:**
|
||||
- `c5.xlarge` (4 vCPU, 8GB RAM) — enough for Go concurrency + Unbound cache
|
||||
- Amazon Linux 2023 or Ubuntu 24.04
|
||||
- Security group: allow SSH (from your IP), allow outbound all
|
||||
- Same VPC/subnet as RDS
|
||||
- Key pair for SSH access
|
||||
|
||||
4. **CloudFront distribution:**
|
||||
- Origin: `everytab-site` S3 bucket (OAC)
|
||||
- Default cache behavior: cache everything, Brotli+Gzip compression
|
||||
- Can set up now or defer to Phase 2
|
||||
|
||||
5. **IAM role for EC2:**
|
||||
- S3 read/write to both buckets
|
||||
- Attach as instance profile
|
||||
|
||||
**Validation:** SSH into EC2, confirm `psql` can connect to RDS, confirm `aws s3 ls` shows both buckets.
|
||||
|
||||
**Done when:** All resources exist, EC2 can reach RDS and S3.
|
||||
|
||||
### Step 0.3: EC2 Environment Setup
|
||||
|
||||
Bootstrap script (`infra/ec2-userdata.sh` or run manually):
|
||||
|
||||
1. Install Go (latest stable, 1.22+)
|
||||
2. Install DuckDB CLI
|
||||
3. Install Unbound, configure as recursive resolver:
|
||||
- `/etc/unbound/unbound.conf`: recursive mode, no forwarding, listen on 127.0.0.1
|
||||
- High cache: `msg-cache-size: 512m`, `rrset-cache-size: 1g`
|
||||
- `cache-min-ttl: 3600`
|
||||
- `prefetch: yes`
|
||||
- `num-threads: 4`
|
||||
4. Set `/etc/resolv.conf` → `nameserver 127.0.0.1`
|
||||
5. Install `psql` client, `pg_dump`
|
||||
6. Confirm DuckDB httpfs extension works: `INSTALL httpfs; LOAD httpfs;`
|
||||
|
||||
**Validation:**
|
||||
- `go version` works
|
||||
- `duckdb -c "INSTALL httpfs; LOAD httpfs; SELECT 1;"` works
|
||||
- `dig example.com @127.0.0.1` resolves (Unbound working)
|
||||
- `psql $DATABASE_URL -c "SELECT 1;"` connects to RDS
|
||||
|
||||
**Done when:** EC2 is a working development environment for all pipeline stages.
|
||||
Bootstrap via `infra/ec2-userdata.sh`:
|
||||
- Go 1.22+, DuckDB (httpfs + postgres extensions), Unbound (recursive resolver), psql, tmux
|
||||
- Unbound configured as system resolver (systemd-resolved disabled)
|
||||
- DATABASE_URL in .bashrc
|
||||
- Schema applied: hosts + icons tables with indexes
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -715,3 +675,22 @@ On completion, each program prints a summary line and writes its stats JSON (wit
|
|||
- **Postgres connection limits:** RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections.
|
||||
- **S3 eventual consistency:** After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash).
|
||||
- **CloudFront caching:** After deploying new bundles, invalidate `/*` or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls).
|
||||
|
||||
---
|
||||
|
||||
## Progress Log
|
||||
|
||||
### Phase 0 — Completed 2026-05-17
|
||||
|
||||
**Changes from original plan:**
|
||||
- Replaced shell scripts (`setup.sh`, `teardown.sh`) with Terraform (`infra/main.tf`). Single file, `var.scanning` bool switches between scanning and serving phases.
|
||||
- SSH key is Terraform-managed (no passphrase, stored in state) rather than manually generated.
|
||||
- CloudFront distribution deferred — not created in Phase 0, will add to Terraform when frontend is ready.
|
||||
- Added `infra/README.md` with terse setup steps for future replication.
|
||||
|
||||
**Lessons learned:**
|
||||
- Shell scripts with `2>/dev/null || echo "already exists"` swallow real errors. Terraform's declarative model avoids this entirely — errors are always surfaced.
|
||||
- RDS requires a DB subnet group (2+ subnets in different AZs). The original shell script didn't create one, causing a silent failure. Terraform handles this dependency automatically.
|
||||
- Amazon Linux 2023 uses `systemd-resolved` which manages `/etc/resolv.conf`. Must disable it before pointing resolv.conf at Unbound. `chattr +i` doesn't work on the symlink.
|
||||
- AWS EC2 key pairs created via API don't support passphrases. Use `tls_private_key` in Terraform or generate locally with `ssh-keygen` + import.
|
||||
- When an AWS key pair name already exists from a previous run, Terraform may not regenerate it. Use `-replace` to force recreation of the key + instance together.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue