update infra README for cloud init

2026-05-25 21:43:31 -04:00 · 2026-05-25 21:43:31 -04:00 · 4bfe165fac
commit 4bfe165fac
parent a92c838d23
1 changed files with 40 additions and 54 deletions
--- a/infra/README.md
+++ b/infra/README.md
@ -6,79 +6,65 @@ Two EC2 instances during scanning:
 - **c5.2xlarge** (`everytab`) — compute: runs pipeline, stores icons on 1TB EBS
 - **i3.large** (`everytab-db`) — database: runs Postgres on 475GB local NVMe (100K+ IOPS)

-Both provisioned by Terraform with `user_data` scripts that run on first boot:
- Compute: `ec2-userdata.sh` (Go, DuckDB, Unbound, swap)
- Database: `db-setup.sh` (NVMe format, Postgres install + config)
+Both provisioned by Terraform with `user_data` scripts that auto-run on first boot:
+- Compute: `ec2-userdata.sh` — installs Go, DuckDB, Unbound, swap; clones repo; builds binaries; applies DB schema
+- Database: `db-setup.sh` — formats NVMe, installs Postgres, creates database + schema

-## 1. Terraform
+## Quick Start
+
+Everything runs from your local machine unless noted.

 ```bash
+# 1. Create infrastructure
 cd infra
-cp terraform.tfvars.example terraform.tfvars  # fill in your values
+cp terraform.tfvars.example terraform.tfvars  # fill in your values (including repo_url)
 terraform init
 terraform apply
-```

-This creates both instances. They auto-provision via user_data (~3 minutes).
-
-## 2. SSH Key
-
-```bash
+# 2. Save SSH key
 terraform output -raw ssh_private_key > everytab-key && chmod 600 everytab-key
-terraform output ssh_command     # SSH to compute instance
-terraform output ssh_command_db  # SSH to database instance
+
+# 3. Wait ~3-5 minutes for both instances to auto-provision, then verify
+ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip) \
+  'pg_isready -h $(grep DATABASE_URL ~/.bashrc | cut -d@ -f2 | cut -d: -f1)'
 ```

-## 3. Verify Database is Ready
+If `repo_url` is set in tfvars, the compute instance automatically:
+- Clones the repo
+- Builds all Go binaries
+- Waits for the DB to be ready
+- Applies the schema

-```bash
-# From your local machine or the compute instance
-pg_isready -h $(terraform output -raw db_private_ip)
-```
+## Running the Pipeline

-If not ready yet, SSH to the DB instance and check `cloud-init` logs:
-```bash
-tail -f /var/log/cloud-init-output.log
-```
-
-## 4. Clone Repo + Build on Compute Instance
+SSH to the compute instance — everything is ready:

 ```bash
 ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip)

-git clone <your-repo-url> ~/everytab
-cd ~/everytab
-go build -o ~/warc_parse ./pipeline/02_warc_parse/
-go build -o ~/icon_download ./pipeline/03_icon_download/
-go build -o ~/bundle_gen ./pipeline/05_bundle_gen/
+# DATABASE_URL is already in .bashrc, binaries already built
+# Start the pipeline (see pipeline/README.md for full guide)
+./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 0
 ```

-## 5. Connect to Database + Apply Schema
+## Debugging (if auto-provision fails)

+Check cloud-init logs on either instance:
 ```bash
-# Get the connection string
-export DATABASE_URL=$(terraform output -raw database_url)
-echo "export DATABASE_URL='$DATABASE_URL'" >> ~/.bashrc
+# Compute instance
+ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip) \
+  'tail -30 /var/log/cloud-init-output.log'

-# Test connectivity
-psql $DATABASE_URL -c 'SELECT 1;'
-
-# Apply schema
-psql $DATABASE_URL -f ~/everytab/pipeline/01_cc_index/schema.sql
+# DB instance
+ssh -i everytab-key ec2-user@$(terraform output -raw db_public_ip) \
+  'tail -30 /var/log/cloud-init-output.log'
 ```

-## 6. Run Pipeline
-
-See `pipeline/README.md` for the full stage-by-stage guide.
-
 ## Pinning the EC2 AMI

-The `data.aws_ami` lookup fetches the latest Amazon Linux 2023 AMI. If Amazon publishes a new one between applies, Terraform will want to replace your instances.
-
-To prevent this, pin the AMI after initial creation:
+The `data.aws_ami` lookup fetches the latest Amazon Linux 2023 AMI. Pin it to prevent instance replacement on unrelated changes:

 ```bash
-# Get the current AMI
 aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \
  --query "Reservations[0].Instances[0].ImageId" --output text

@ -86,27 +72,27 @@ aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \
 echo 'ec2_ami = "ami-XXXXXXXXXXXX"' >> terraform.tfvars
 ```

-Remove the `ec2_ami` line from tfvars when you want fresh instances with the latest AMI.
+Remove the line when you want fresh instances with the latest AMI.

-## Teardown (after backup)
+## Teardown
+
+From the compute instance, back up before tearing down:

 ```bash
-# Back up the database (run from compute instance)
+# Back up database
 pg_dump $DATABASE_URL -Fc > ~/everytab_dump.pgfc

 # Back up icons to homelab
 rsync -avP ~/icons/ homelab:/backups/everytab/icons/
 ```

-Switch to serving-only mode (destroys both EC2 instances):
+From your local machine:

 ```bash
+# Destroy scanning infrastructure (keeps CloudFront + site bucket)
 terraform apply -var="scanning=false"
-```

-Full destroy (including the live site):
-
-```bash
+# Or full destroy (including the live site)
 terraform destroy
 ```