automated ec2 setup and build

2026-05-25 18:29:37 -04:00 · 2026-05-25 18:29:37 -04:00 · 1afbc41599
commit 1afbc41599
parent bf8b932cdc
5 changed files with 103 additions and 49 deletions
--- a/infra/README.md
+++ b/infra/README.md
@ -1,5 +1,15 @@
 # Infrastructure Setup

+## Architecture
+
+Two EC2 instances during scanning:
+- **c5.2xlarge** (`everytab`) — compute: runs pipeline, stores icons on 1TB EBS
+- **i3.large** (`everytab-db`) — database: runs Postgres on 475GB local NVMe (100K+ IOPS)
+
+Both provisioned by Terraform with `user_data` scripts that run on first boot:
+- Compute: `ec2-userdata.sh` (Go, DuckDB, Unbound, swap)
+- Database: `db-setup.sh` (NVMe format, Postgres install + config)
+
 ## 1. Terraform

 ```bash
@ -9,60 +19,66 @@ terraform init
 terraform apply
 ```

+This creates both instances. They auto-provision via user_data (~3 minutes).
+
 ## 2. SSH Key

 ```bash
 terraform output -raw ssh_private_key > everytab-key && chmod 600 everytab-key
-terraform output ssh_command  # prints the ssh command
+terraform output ssh_command     # SSH to compute instance
+terraform output ssh_command_db  # SSH to database instance
 ```

-## 3. Bootstrap EC2
+## 3. Verify Database is Ready

 ```bash
-scp -i everytab-key ec2-userdata.sh ec2-user@<IP>:~
-ssh -i everytab-key ec2-user@<IP> 'bash ~/ec2-userdata.sh'
+# From your local machine or the compute instance
+pg_isready -h $(terraform output -raw db_private_ip)
 ```

-## 4. Clone Repo on EC2
+If not ready yet, SSH to the DB instance and check `cloud-init` logs:
+```bash
+tail -f /var/log/cloud-init-output.log
+```
+
+## 4. Clone Repo + Build on Compute Instance

 ```bash
+ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip)
+
 git clone <your-repo-url> ~/everytab
 cd ~/everytab
+go build -o ~/warc_parse ./pipeline/02_warc_parse/
+go build -o ~/icon_download ./pipeline/03_icon_download/
+go build -o ~/bundle_gen ./pipeline/05_bundle_gen/
 ```

-## 5. Database Instance (i3.large)
-
-Spin up an i3.large in the same AZ as the compute instance. This provides 475GB local NVMe with 100K+ IOPS for Postgres — eliminates the EBS/RDS IOPS bottleneck.
+## 5. Connect to Database + Apply Schema

 ```bash
-# Launch i3.large (same subnet/AZ, same key pair, allow port 5432 from compute SG)
-# Then SSH in and run:
-bash ~/everytab/infra/db-setup.sh
-```
-
-This formats the NVMe, installs Postgres on it with aggressive write settings (`fsync=off`), creates the database, and applies the schema.
-
-On the **compute instance** (c5.2xlarge):
-
-```bash
-# Use the private IP printed by db-setup.sh
-echo "export DATABASE_URL='postgres://everytab@<i3-private-ip>:5432/everytab'" >> ~/.bashrc
-source ~/.bashrc
+# Get the connection string
+export DATABASE_URL=$(terraform output -raw database_url)
+echo "export DATABASE_URL='$DATABASE_URL'" >> ~/.bashrc

 # Test connectivity
 psql $DATABASE_URL -c 'SELECT 1;'
+
+# Apply schema
+psql $DATABASE_URL -f ~/everytab/pipeline/01_cc_index/schema.sql
 ```

-Note: the i3's local NVMe is ephemeral — data is lost on stop/terminate. Always `pg_dump` before teardown.
+## 6. Run Pipeline
+
+See `pipeline/README.md` for the full stage-by-stage guide.

 ## Pinning the EC2 AMI

-The `data.aws_ami` lookup fetches the latest Amazon Linux 2023 AMI. If Amazon publishes a new one between applies, Terraform will want to replace your EC2 instance.
+The `data.aws_ami` lookup fetches the latest Amazon Linux 2023 AMI. If Amazon publishes a new one between applies, Terraform will want to replace your instances.

 To prevent this, pin the AMI after initial creation:

 ```bash
-# Get the current instance's AMI
+# Get the current AMI
 aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \
  --query "Reservations[0].Instances[0].ImageId" --output text

@ -70,21 +86,19 @@ aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \
 echo 'ec2_ami = "ami-XXXXXXXXXXXX"' >> terraform.tfvars
 ```

-Now `terraform apply` won't replace the instance for non-EC2 changes (like adding CloudFront logging).
-
-Remove the `ec2_ami` line from tfvars when you want a fresh instance with the latest AMI (e.g., after teardown).
+Remove the `ec2_ami` line from tfvars when you want fresh instances with the latest AMI.

 ## Teardown (after backup)

 ```bash
-# Back up the database first
-pg_dump -U everytab -Fc everytab > ~/everytab_dump.pgfc
+# Back up the database (run from compute instance)
+pg_dump $DATABASE_URL -Fc > ~/everytab_dump.pgfc

-# Back up icons
+# Back up icons to homelab
 rsync -avP ~/icons/ homelab:/backups/everytab/icons/
 ```

-Switch to serving-only mode (destroys EC2, icons bucket):
+Switch to serving-only mode (destroys both EC2 instances):

 ```bash
 terraform apply -var="scanning=false"
@ -95,3 +109,5 @@ Full destroy (including the live site):
 ```bash
 terraform destroy
 ```
+
+**IMPORTANT:** The i3's local NVMe is ephemeral — all data is lost on stop/terminate. Always pg_dump before teardown.