update infra README for cloud init

This commit is contained in:
Joe Lothan 2026-05-25 21:43:31 -04:00
parent a92c838d23
commit 4bfe165fac

View file

@ -6,79 +6,65 @@ Two EC2 instances during scanning:
- **c5.2xlarge** (`everytab`) — compute: runs pipeline, stores icons on 1TB EBS - **c5.2xlarge** (`everytab`) — compute: runs pipeline, stores icons on 1TB EBS
- **i3.large** (`everytab-db`) — database: runs Postgres on 475GB local NVMe (100K+ IOPS) - **i3.large** (`everytab-db`) — database: runs Postgres on 475GB local NVMe (100K+ IOPS)
Both provisioned by Terraform with `user_data` scripts that run on first boot: Both provisioned by Terraform with `user_data` scripts that auto-run on first boot:
- Compute: `ec2-userdata.sh` (Go, DuckDB, Unbound, swap) - Compute: `ec2-userdata.sh` — installs Go, DuckDB, Unbound, swap; clones repo; builds binaries; applies DB schema
- Database: `db-setup.sh` (NVMe format, Postgres install + config) - Database: `db-setup.sh` — formats NVMe, installs Postgres, creates database + schema
## 1. Terraform ## Quick Start
Everything runs from your local machine unless noted.
```bash ```bash
# 1. Create infrastructure
cd infra cd infra
cp terraform.tfvars.example terraform.tfvars # fill in your values cp terraform.tfvars.example terraform.tfvars # fill in your values (including repo_url)
terraform init terraform init
terraform apply terraform apply
```
This creates both instances. They auto-provision via user_data (~3 minutes). # 2. Save SSH key
## 2. SSH Key
```bash
terraform output -raw ssh_private_key > everytab-key && chmod 600 everytab-key terraform output -raw ssh_private_key > everytab-key && chmod 600 everytab-key
terraform output ssh_command # SSH to compute instance
terraform output ssh_command_db # SSH to database instance # 3. Wait ~3-5 minutes for both instances to auto-provision, then verify
ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip) \
'pg_isready -h $(grep DATABASE_URL ~/.bashrc | cut -d@ -f2 | cut -d: -f1)'
``` ```
## 3. Verify Database is Ready If `repo_url` is set in tfvars, the compute instance automatically:
- Clones the repo
- Builds all Go binaries
- Waits for the DB to be ready
- Applies the schema
```bash ## Running the Pipeline
# From your local machine or the compute instance
pg_isready -h $(terraform output -raw db_private_ip)
```
If not ready yet, SSH to the DB instance and check `cloud-init` logs: SSH to the compute instance — everything is ready:
```bash
tail -f /var/log/cloud-init-output.log
```
## 4. Clone Repo + Build on Compute Instance
```bash ```bash
ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip) ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip)
git clone <your-repo-url> ~/everytab # DATABASE_URL is already in .bashrc, binaries already built
cd ~/everytab # Start the pipeline (see pipeline/README.md for full guide)
go build -o ~/warc_parse ./pipeline/02_warc_parse/ ./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 0
go build -o ~/icon_download ./pipeline/03_icon_download/
go build -o ~/bundle_gen ./pipeline/05_bundle_gen/
``` ```
## 5. Connect to Database + Apply Schema ## Debugging (if auto-provision fails)
Check cloud-init logs on either instance:
```bash ```bash
# Get the connection string # Compute instance
export DATABASE_URL=$(terraform output -raw database_url) ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip) \
echo "export DATABASE_URL='$DATABASE_URL'" >> ~/.bashrc 'tail -30 /var/log/cloud-init-output.log'
# Test connectivity # DB instance
psql $DATABASE_URL -c 'SELECT 1;' ssh -i everytab-key ec2-user@$(terraform output -raw db_public_ip) \
'tail -30 /var/log/cloud-init-output.log'
# Apply schema
psql $DATABASE_URL -f ~/everytab/pipeline/01_cc_index/schema.sql
``` ```
## 6. Run Pipeline
See `pipeline/README.md` for the full stage-by-stage guide.
## Pinning the EC2 AMI ## Pinning the EC2 AMI
The `data.aws_ami` lookup fetches the latest Amazon Linux 2023 AMI. If Amazon publishes a new one between applies, Terraform will want to replace your instances. The `data.aws_ami` lookup fetches the latest Amazon Linux 2023 AMI. Pin it to prevent instance replacement on unrelated changes:
To prevent this, pin the AMI after initial creation:
```bash ```bash
# Get the current AMI
aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \ aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \
--query "Reservations[0].Instances[0].ImageId" --output text --query "Reservations[0].Instances[0].ImageId" --output text
@ -86,27 +72,27 @@ aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \
echo 'ec2_ami = "ami-XXXXXXXXXXXX"' >> terraform.tfvars echo 'ec2_ami = "ami-XXXXXXXXXXXX"' >> terraform.tfvars
``` ```
Remove the `ec2_ami` line from tfvars when you want fresh instances with the latest AMI. Remove the line when you want fresh instances with the latest AMI.
## Teardown (after backup) ## Teardown
From the compute instance, back up before tearing down:
```bash ```bash
# Back up the database (run from compute instance) # Back up database
pg_dump $DATABASE_URL -Fc > ~/everytab_dump.pgfc pg_dump $DATABASE_URL -Fc > ~/everytab_dump.pgfc
# Back up icons to homelab # Back up icons to homelab
rsync -avP ~/icons/ homelab:/backups/everytab/icons/ rsync -avP ~/icons/ homelab:/backups/everytab/icons/
``` ```
Switch to serving-only mode (destroys both EC2 instances): From your local machine:
```bash ```bash
# Destroy scanning infrastructure (keeps CloudFront + site bucket)
terraform apply -var="scanning=false" terraform apply -var="scanning=false"
```
Full destroy (including the live site): # Or full destroy (including the live site)
```bash
terraform destroy terraform destroy
``` ```