added infra setup with terraform

This commit is contained in:
Joe Lothan 2026-05-17 16:07:50 -04:00
parent 64ae58494b
commit fcf203e1d8
8 changed files with 556 additions and 74 deletions

10
.gitignore vendored Normal file
View file

@ -0,0 +1,10 @@
stats/
*.env
.env.*
infra/everytab-key
infra/everytab-key.pub
infra/.terraform/
infra/.terraform.lock.hcl
infra/terraform.tfstate
infra/terraform.tfstate.backup
infra/terraform.tfvars

127
PLAN.md
View file

@ -6,95 +6,55 @@ Each step has a clear deliverable and validation criteria. Steps are sequential
---
## Phase 0: Project Setup & AWS Infrastructure
## Phase 0: Project Setup & AWS Infrastructure [COMPLETED]
### Step 0.1: Repository Structure
Create the project layout:
### Step 0.1: Repository Structure [COMPLETED]
```
everytab/
├── design.md
├── ARCHITECTURE.md
├── PLAN.md
├── infra/ # AWS CLI scripts for setup/teardown
│ ├── setup.sh # Create RDS, S3 buckets, security groups
│ ├── teardown.sh # Delete non-permanent resources
│ └── ec2-userdata.sh # EC2 bootstrap (install Go, DuckDB, Unbound)
├── infra/
│ ├── main.tf # Terraform: all AWS resources
│ ├── terraform.tfvars.example
│ ├── ec2-userdata.sh # EC2 bootstrap (Go, DuckDB, Unbound)
│ └── README.md # Setup steps
├── pipeline/
│ ├── 01_cc_index/ # DuckDB query scripts
│ ├── 02_warc_parse/ # Go program
│ ├── 03_icon_download/# Go program
│ ├── 04_best_icon/ # SQL script
│ ├── 05_bundle_gen/ # Go program
│ └── 06_frontend/ # Build script, templates
│ ├── 01_cc_index/
│ │ └── schema.sql # Postgres table definitions
│ ├── 02_warc_parse/
│ ├── 03_icon_download/
│ ├── 04_best_icon/
│ ├── 05_bundle_gen/
│ └── 06_frontend/
├── frontend/
│ ├── index.html
│ └── site.js
├── stats/ # Stats output from each stage (gitignored)
└── go.mod # Shared Go module for pipeline programs
├── stats/ # gitignored
└── go.mod
```
**Done when:** Repo structure exists, `go.mod` initialized, `.gitignore` covers stats/ and any local config.
### Step 0.2: AWS Infrastructure (Terraform) [COMPLETED]
### Step 0.2: AWS Infrastructure (Manual CLI)
Infrastructure managed via `infra/main.tf`. Single file, uses `var.scanning` bool to switch phases:
- `terraform apply` — creates all scanning resources (EC2, RDS, S3 icons, S3 site, IAM, security groups)
- `terraform apply -var="scanning=false"` — destroys scanning resources, keeps site bucket
- `terraform destroy` — removes everything
Create resources using AWS CLI commands in `infra/setup.sh`:
Resources created:
- S3 `everytab-icons` (private), S3 `everytab-site` (for CloudFront later)
- RDS Postgres 16, db.t3.medium, 20GB gp3
- EC2 c5.xlarge, Amazon Linux 2023, 50GB gp3
- Security groups (SSH from home IP, RDS from EC2 only)
- IAM role + instance profile (S3 access only)
- SSH key (Terraform-managed ed25519)
1. **S3 buckets:**
- `everytab-icons` (private, no public access)
- `everytab-site` (private, accessed via CloudFront OAC)
### Step 0.3: EC2 Environment Setup [COMPLETED]
2. **RDS Postgres:**
- `db.t3.medium`, 20GB storage (expandable), Postgres 16
- In a VPC, security group allows inbound 5432 from EC2 security group
- No public access (EC2 connects within VPC)
- No multi-AZ (dev, not production)
- Set a strong password, store in a local `.env` (gitignored)
3. **EC2 instance:**
- `c5.xlarge` (4 vCPU, 8GB RAM) — enough for Go concurrency + Unbound cache
- Amazon Linux 2023 or Ubuntu 24.04
- Security group: allow SSH (from your IP), allow outbound all
- Same VPC/subnet as RDS
- Key pair for SSH access
4. **CloudFront distribution:**
- Origin: `everytab-site` S3 bucket (OAC)
- Default cache behavior: cache everything, Brotli+Gzip compression
- Can set up now or defer to Phase 2
5. **IAM role for EC2:**
- S3 read/write to both buckets
- Attach as instance profile
**Validation:** SSH into EC2, confirm `psql` can connect to RDS, confirm `aws s3 ls` shows both buckets.
**Done when:** All resources exist, EC2 can reach RDS and S3.
### Step 0.3: EC2 Environment Setup
Bootstrap script (`infra/ec2-userdata.sh` or run manually):
1. Install Go (latest stable, 1.22+)
2. Install DuckDB CLI
3. Install Unbound, configure as recursive resolver:
- `/etc/unbound/unbound.conf`: recursive mode, no forwarding, listen on 127.0.0.1
- High cache: `msg-cache-size: 512m`, `rrset-cache-size: 1g`
- `cache-min-ttl: 3600`
- `prefetch: yes`
- `num-threads: 4`
4. Set `/etc/resolv.conf``nameserver 127.0.0.1`
5. Install `psql` client, `pg_dump`
6. Confirm DuckDB httpfs extension works: `INSTALL httpfs; LOAD httpfs;`
**Validation:**
- `go version` works
- `duckdb -c "INSTALL httpfs; LOAD httpfs; SELECT 1;"` works
- `dig example.com @127.0.0.1` resolves (Unbound working)
- `psql $DATABASE_URL -c "SELECT 1;"` connects to RDS
**Done when:** EC2 is a working development environment for all pipeline stages.
Bootstrap via `infra/ec2-userdata.sh`:
- Go 1.22+, DuckDB (httpfs + postgres extensions), Unbound (recursive resolver), psql, tmux
- Unbound configured as system resolver (systemd-resolved disabled)
- DATABASE_URL in .bashrc
- Schema applied: hosts + icons tables with indexes
---
@ -715,3 +675,22 @@ On completion, each program prints a summary line and writes its stats JSON (wit
- **Postgres connection limits:** RDS db.t3.medium has max_connections ≈ 80. With 1000 goroutines, we need connection pooling (pgx pool handles this). Set pool max to ~40 connections.
- **S3 eventual consistency:** After uploading an icon, a HEAD request might not find it immediately. For dedup checks, handle "not found" gracefully (just upload again — idempotent since key is content hash).
- **CloudFront caching:** After deploying new bundles, invalidate `/*` or set short TTL during development. For production, use long TTLs (bundles are immutable between crawls).
---
## Progress Log
### Phase 0 — Completed 2026-05-17
**Changes from original plan:**
- Replaced shell scripts (`setup.sh`, `teardown.sh`) with Terraform (`infra/main.tf`). Single file, `var.scanning` bool switches between scanning and serving phases.
- SSH key is Terraform-managed (no passphrase, stored in state) rather than manually generated.
- CloudFront distribution deferred — not created in Phase 0, will add to Terraform when frontend is ready.
- Added `infra/README.md` with terse setup steps for future replication.
**Lessons learned:**
- Shell scripts with `2>/dev/null || echo "already exists"` swallow real errors. Terraform's declarative model avoids this entirely — errors are always surfaced.
- RDS requires a DB subnet group (2+ subnets in different AZs). The original shell script didn't create one, causing a silent failure. Terraform handles this dependency automatically.
- Amazon Linux 2023 uses `systemd-resolved` which manages `/etc/resolv.conf`. Must disable it before pointing resolv.conf at Unbound. `chattr +i` doesn't work on the symlink.
- AWS EC2 key pairs created via API don't support passphrases. Use `tls_private_key` in Terraform or generate locally with `ssh-keygen` + import.
- When an AWS key pair name already exists from a previous run, Terraform may not regenerate it. Use `-replace` to force recreation of the key + instance together.

3
go.mod Normal file
View file

@ -0,0 +1,3 @@
module github.com/joe/everytab
go 1.25.9

61
infra/README.md Normal file
View file

@ -0,0 +1,61 @@
# Infrastructure Setup
## 1. Terraform
```bash
cd infra
cp terraform.tfvars.example terraform.tfvars # fill in your values
terraform init
terraform apply
```
## 2. SSH Key
```bash
terraform output -raw ssh_private_key > everytab-key && chmod 600 everytab-key
terraform output ssh_command # prints the ssh command
```
## 3. Bootstrap EC2
```bash
scp -i everytab-key ec2-userdata.sh ec2-user@<IP>:~
ssh -i everytab-key ec2-user@<IP> 'bash ~/ec2-userdata.sh'
```
## 4. Clone Repo on EC2
```bash
git clone <your-repo-url> ~/everytab
cd ~/everytab
```
## 5. Database Setup
On the EC2 instance:
```bash
# Add to .bashrc (get the URL from: terraform output -raw database_url)
echo "export DATABASE_URL='postgres://everytab:PASS@ENDPOINT:5432/everytab'" >> ~/.bashrc
source ~/.bashrc
# Test connection
psql $DATABASE_URL -c 'SELECT 1;'
# Create schema
psql $DATABASE_URL -f ~/everytab/pipeline/01_cc_index/schema.sql
```
## Teardown (after backup)
Switch to serving-only mode (destroys EC2, RDS, icons bucket):
```bash
terraform apply -var="scanning=false"
```
Full destroy (including the live site):
```bash
terraform destroy
```

124
infra/ec2-userdata.sh Executable file
View file

@ -0,0 +1,124 @@
#!/usr/bin/env bash
set -euo pipefail
# EveryTab EC2 Bootstrap
# Run this on the EC2 instance after first SSH connection.
# Installs: Go, DuckDB, Unbound, psql, pg_dump
echo "=== EveryTab EC2 Bootstrap ==="
# --- System packages ---
echo "--- Installing system packages ---"
sudo dnf update -y
sudo dnf install -y \
gcc \
git \
postgresql16 \
unbound \
jq \
htop \
tmux
# --- Go ---
echo "--- Installing Go ---"
GO_VERSION="1.22.4"
if ! command -v go &>/dev/null; then
curl -fsSL "https://go.dev/dl/go${GO_VERSION}.linux-amd64.tar.gz" | sudo tar -C /usr/local -xz
echo 'export PATH=$PATH:/usr/local/go/bin:$HOME/go/bin' >> ~/.bashrc
export PATH=$PATH:/usr/local/go/bin:$HOME/go/bin
fi
go version
# --- DuckDB ---
echo "--- Installing DuckDB ---"
DUCKDB_VERSION="1.1.0"
if ! command -v duckdb &>/dev/null; then
curl -fsSL "https://github.com/duckdb/duckdb/releases/download/v${DUCKDB_VERSION}/duckdb_cli-linux-amd64.zip" -o /tmp/duckdb.zip
cd /tmp && unzip -o duckdb.zip && sudo mv duckdb /usr/local/bin/ && cd -
fi
duckdb -c "SELECT 'DuckDB OK';"
# Install DuckDB extensions
duckdb -c "INSTALL httpfs; INSTALL postgres;"
echo "DuckDB extensions installed"
# --- Unbound ---
echo "--- Configuring Unbound ---"
sudo tee /etc/unbound/unbound.conf > /dev/null <<'UNBOUNDCONF'
server:
interface: 127.0.0.1
port: 53
access-control: 127.0.0.0/8 allow
# Performance
num-threads: 4
msg-cache-slabs: 4
rrset-cache-slabs: 4
infra-cache-slabs: 4
key-cache-slabs: 4
# Cache sizing (use available RAM)
msg-cache-size: 512m
rrset-cache-size: 1g
key-cache-size: 256m
# Aggressive caching
cache-min-ttl: 3600
cache-max-ttl: 86400
prefetch: yes
prefetch-key: yes
# Hardening
hide-identity: yes
hide-version: yes
harden-glue: yes
harden-dnssec-stripped: yes
# Logging (minimal)
verbosity: 1
log-queries: no
# Root hints
root-hints: "/etc/unbound/root.hints"
remote-control:
control-enable: yes
control-interface: 127.0.0.1
UNBOUNDCONF
# Download root hints
sudo curl -fsSL https://www.internic.net/domain/named.root -o /etc/unbound/root.hints
# Disable systemd-resolved if present (it manages resolv.conf on AL2023)
if systemctl is-active --quiet systemd-resolved 2>/dev/null; then
sudo systemctl disable --now systemd-resolved
fi
# Set system resolver to use Unbound
sudo rm -f /etc/resolv.conf
echo "nameserver 127.0.0.1" | sudo tee /etc/resolv.conf > /dev/null
# Start and enable Unbound
sudo systemctl enable unbound
sudo systemctl restart unbound
# Generate control keys for unbound-control stats
sudo unbound-control-setup 2>/dev/null || true
echo ""
# --- Validation ---
echo "=== Validation ==="
echo -n "Go: "; go version
echo -n "DuckDB: "; duckdb -c "SELECT version();" -noheader -csv
echo -n "Unbound: "; dig +short example.com @127.0.0.1 | head -1
echo -n "psql: "; psql --version
echo ""
echo "=== Bootstrap Complete ==="
echo ""
echo "Next: set up your database connection string."
echo " export DATABASE_URL='postgres://everytab:PASSWORD@RDS_ENDPOINT:5432/everytab'"
echo ""
echo "Test connection:"
echo " psql \$DATABASE_URL -c 'SELECT 1;'"

253
infra/main.tf Normal file
View file

@ -0,0 +1,253 @@
terraform {
required_version = ">= 1.5"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
tls = {
source = "hashicorp/tls"
version = "~> 4.0"
}
}
}
provider "aws" {
region = var.region
}
# --- Variables ---
variable "region" {
default = "us-east-1"
}
variable "vpc_id" {
description = "VPC ID to deploy into"
type = string
}
variable "subnet_ids" {
description = "At least 2 subnet IDs in different AZs (required for RDS subnet group)"
type = list(string)
}
variable "db_password" {
description = "Postgres master password"
type = string
sensitive = true
}
variable "ssh_cidr" {
description = "CIDR block for SSH access (e.g., 203.0.113.50/32)"
type = string
}
variable "ec2_instance_type" {
default = "c5.xlarge"
}
variable "scanning" {
description = "Set to true during scanning phase, false for serving-only (tears down EC2, RDS, icons bucket)"
type = bool
default = true
}
# --- Data sources ---
data "aws_ami" "al2023" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["al2023-ami-2023*-x86_64"]
}
filter {
name = "state"
values = ["available"]
}
}
# --- SSH Key ---
resource "tls_private_key" "ec2" {
count = var.scanning ? 1 : 0
algorithm = "ED25519"
}
resource "aws_key_pair" "ec2" {
count = var.scanning ? 1 : 0
key_name = "everytab-key"
public_key = tls_private_key.ec2[0].public_key_openssh
}
# --- Security Groups ---
resource "aws_security_group" "ec2" {
count = var.scanning ? 1 : 0
name = "everytab-ec2"
description = "EveryTab EC2 instance"
vpc_id = var.vpc_id
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [var.ssh_cidr]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "rds" {
count = var.scanning ? 1 : 0
name = "everytab-rds"
description = "EveryTab RDS instance"
vpc_id = var.vpc_id
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.ec2[0].id]
}
}
# --- IAM ---
resource "aws_iam_role" "ec2" {
count = var.scanning ? 1 : 0
name = "everytab-ec2-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "ec2.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy" "s3_access" {
count = var.scanning ? 1 : 0
name = "everytab-s3-access"
role = aws_iam_role.ec2[0].id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket", "s3:HeadObject"]
Resource = [
aws_s3_bucket.icons[0].arn,
"${aws_s3_bucket.icons[0].arn}/*",
aws_s3_bucket.site.arn,
"${aws_s3_bucket.site.arn}/*",
]
}]
})
}
resource "aws_iam_instance_profile" "ec2" {
count = var.scanning ? 1 : 0
name = "everytab-ec2-profile"
role = aws_iam_role.ec2[0].name
}
# --- S3 ---
resource "aws_s3_bucket" "icons" {
count = var.scanning ? 1 : 0
bucket = "everytab-icons"
}
resource "aws_s3_bucket_public_access_block" "icons" {
count = var.scanning ? 1 : 0
bucket = aws_s3_bucket.icons[0].id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket" "site" {
bucket = "everytab-site"
}
# --- RDS ---
resource "aws_db_subnet_group" "main" {
count = var.scanning ? 1 : 0
name = "everytab"
subnet_ids = var.subnet_ids
}
resource "aws_db_instance" "main" {
count = var.scanning ? 1 : 0
identifier = "everytab"
engine = "postgres"
engine_version = "16"
instance_class = "db.t3.medium"
allocated_storage = 20
storage_type = "gp3"
db_name = "everytab"
username = "everytab"
password = var.db_password
db_subnet_group_name = aws_db_subnet_group.main[0].name
vpc_security_group_ids = [aws_security_group.rds[0].id]
publicly_accessible = false
multi_az = false
backup_retention_period = 0
skip_final_snapshot = true
}
# --- EC2 ---
resource "aws_instance" "main" {
count = var.scanning ? 1 : 0
ami = data.aws_ami.al2023.id
instance_type = var.ec2_instance_type
key_name = aws_key_pair.ec2[0].key_name
vpc_security_group_ids = [aws_security_group.ec2[0].id]
subnet_id = var.subnet_ids[0]
iam_instance_profile = aws_iam_instance_profile.ec2[0].name
root_block_device {
volume_size = 50
volume_type = "gp3"
}
tags = {
Name = "everytab"
}
}
# --- Outputs ---
output "ec2_public_ip" {
value = var.scanning ? aws_instance.main[0].public_ip : null
}
output "rds_endpoint" {
value = var.scanning ? aws_db_instance.main[0].endpoint : null
}
output "database_url" {
value = var.scanning ? "postgres://everytab:${var.db_password}@${aws_db_instance.main[0].endpoint}/everytab" : null
sensitive = true
}
output "ssh_private_key" {
value = var.scanning ? tls_private_key.ec2[0].private_key_openssh : null
sensitive = true
}
output "ssh_command" {
value = var.scanning ? "ssh -i everytab-key ec2-user@${aws_instance.main[0].public_ip}" : null
}

View file

@ -0,0 +1,10 @@
# Copy to terraform.tfvars and fill in your values
vpc_id = "vpc-0abc123def456"
subnet_ids = ["subnet-0abc123", "subnet-0def456"] # 2+ subnets in different AZs
db_password = "change-me-to-something-secure"
ssh_cidr = "203.0.113.50/32" # Your home IP
# Optional overrides:
# region = "us-east-1"
# ec2_instance_type = "c5.xlarge"
# scanning = true

View file

@ -0,0 +1,42 @@
-- EveryTab database schema
-- Run once after RDS is available:
-- psql $DATABASE_URL -f schema.sql
CREATE TABLE IF NOT EXISTS hosts (
id SERIAL PRIMARY KEY,
hostname TEXT NOT NULL UNIQUE,
protocol TEXT NOT NULL,
crawl_id TEXT NOT NULL,
warc_filename TEXT NOT NULL,
warc_record_offset BIGINT NOT NULL,
warc_record_length INT NOT NULL,
html_title TEXT,
iframe_allowed BOOLEAN,
best_icon_s3_key TEXT,
parsed BOOLEAN DEFAULT FALSE
);
CREATE TABLE IF NOT EXISTS icons (
id SERIAL PRIMARY KEY,
host_id INT NOT NULL REFERENCES hosts(id),
url TEXT NOT NULL,
source TEXT NOT NULL,
rel_type TEXT,
rel_sizes TEXT,
content_type TEXT,
width INT,
height INT,
file_size INT,
s3_key TEXT,
scan_state TEXT DEFAULT 'unscanned',
error TEXT
);
-- Partial index: only unscanned icons (shrinks as work completes)
CREATE INDEX IF NOT EXISTS idx_icons_unscanned ON icons(id) WHERE scan_state = 'unscanned';
-- For best-icon selection join
CREATE INDEX IF NOT EXISTS idx_icons_host_id ON icons(host_id);
-- For WARC parsing cursor
CREATE INDEX IF NOT EXISTS idx_hosts_unparsed ON hosts(id) WHERE parsed = FALSE;