diff --git a/infra/README.md b/infra/README.md index ca759b0..cd4e711 100644 --- a/infra/README.md +++ b/infra/README.md @@ -6,79 +6,65 @@ Two EC2 instances during scanning: - **c5.2xlarge** (`everytab`) — compute: runs pipeline, stores icons on 1TB EBS - **i3.large** (`everytab-db`) — database: runs Postgres on 475GB local NVMe (100K+ IOPS) -Both provisioned by Terraform with `user_data` scripts that run on first boot: -- Compute: `ec2-userdata.sh` (Go, DuckDB, Unbound, swap) -- Database: `db-setup.sh` (NVMe format, Postgres install + config) +Both provisioned by Terraform with `user_data` scripts that auto-run on first boot: +- Compute: `ec2-userdata.sh` — installs Go, DuckDB, Unbound, swap; clones repo; builds binaries; applies DB schema +- Database: `db-setup.sh` — formats NVMe, installs Postgres, creates database + schema -## 1. Terraform +## Quick Start + +Everything runs from your local machine unless noted. ```bash +# 1. Create infrastructure cd infra -cp terraform.tfvars.example terraform.tfvars # fill in your values +cp terraform.tfvars.example terraform.tfvars # fill in your values (including repo_url) terraform init terraform apply -``` -This creates both instances. They auto-provision via user_data (~3 minutes). - -## 2. SSH Key - -```bash +# 2. Save SSH key terraform output -raw ssh_private_key > everytab-key && chmod 600 everytab-key -terraform output ssh_command # SSH to compute instance -terraform output ssh_command_db # SSH to database instance + +# 3. Wait ~3-5 minutes for both instances to auto-provision, then verify +ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip) \ + 'pg_isready -h $(grep DATABASE_URL ~/.bashrc | cut -d@ -f2 | cut -d: -f1)' ``` -## 3. Verify Database is Ready +If `repo_url` is set in tfvars, the compute instance automatically: +- Clones the repo +- Builds all Go binaries +- Waits for the DB to be ready +- Applies the schema -```bash -# From your local machine or the compute instance -pg_isready -h $(terraform output -raw db_private_ip) -``` +## Running the Pipeline -If not ready yet, SSH to the DB instance and check `cloud-init` logs: -```bash -tail -f /var/log/cloud-init-output.log -``` - -## 4. Clone Repo + Build on Compute Instance +SSH to the compute instance — everything is ready: ```bash ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip) -git clone ~/everytab -cd ~/everytab -go build -o ~/warc_parse ./pipeline/02_warc_parse/ -go build -o ~/icon_download ./pipeline/03_icon_download/ -go build -o ~/bundle_gen ./pipeline/05_bundle_gen/ +# DATABASE_URL is already in .bashrc, binaries already built +# Start the pipeline (see pipeline/README.md for full guide) +./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 0 ``` -## 5. Connect to Database + Apply Schema +## Debugging (if auto-provision fails) +Check cloud-init logs on either instance: ```bash -# Get the connection string -export DATABASE_URL=$(terraform output -raw database_url) -echo "export DATABASE_URL='$DATABASE_URL'" >> ~/.bashrc +# Compute instance +ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip) \ + 'tail -30 /var/log/cloud-init-output.log' -# Test connectivity -psql $DATABASE_URL -c 'SELECT 1;' - -# Apply schema -psql $DATABASE_URL -f ~/everytab/pipeline/01_cc_index/schema.sql +# DB instance +ssh -i everytab-key ec2-user@$(terraform output -raw db_public_ip) \ + 'tail -30 /var/log/cloud-init-output.log' ``` -## 6. Run Pipeline - -See `pipeline/README.md` for the full stage-by-stage guide. - ## Pinning the EC2 AMI -The `data.aws_ami` lookup fetches the latest Amazon Linux 2023 AMI. If Amazon publishes a new one between applies, Terraform will want to replace your instances. - -To prevent this, pin the AMI after initial creation: +The `data.aws_ami` lookup fetches the latest Amazon Linux 2023 AMI. Pin it to prevent instance replacement on unrelated changes: ```bash -# Get the current AMI aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \ --query "Reservations[0].Instances[0].ImageId" --output text @@ -86,27 +72,27 @@ aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \ echo 'ec2_ami = "ami-XXXXXXXXXXXX"' >> terraform.tfvars ``` -Remove the `ec2_ami` line from tfvars when you want fresh instances with the latest AMI. +Remove the line when you want fresh instances with the latest AMI. -## Teardown (after backup) +## Teardown + +From the compute instance, back up before tearing down: ```bash -# Back up the database (run from compute instance) +# Back up database pg_dump $DATABASE_URL -Fc > ~/everytab_dump.pgfc # Back up icons to homelab rsync -avP ~/icons/ homelab:/backups/everytab/icons/ ``` -Switch to serving-only mode (destroys both EC2 instances): +From your local machine: ```bash +# Destroy scanning infrastructure (keeps CloudFront + site bucket) terraform apply -var="scanning=false" -``` -Full destroy (including the live site): - -```bash +# Or full destroy (including the live site) terraform destroy ``` diff --git a/infra/db-setup.sh b/infra/db-setup.sh index 5341024..954c6cf 100755 --- a/infra/db-setup.sh +++ b/infra/db-setup.sh @@ -15,14 +15,13 @@ NVME_DEV="/dev/nvme1n1" NVME_MOUNT="/data" if [ ! -d "$NVME_MOUNT" ]; then - # Find the NVMe instance store (not the root EBS) - # i3.large has one 475GB NVMe at /dev/nvme1n1 or similar + # Find the NVMe instance store — look for unmounted nvme devices if [ ! -b "$NVME_DEV" ]; then - # Try finding it - NVME_DEV=$(lsblk -dpno NAME,SIZE | grep -v "$(lsblk -dpno NAME /)" | head -1 | awk '{print $1}') + NVME_DEV=$(lsblk -dpno NAME | grep nvme | head -1) if [ -z "$NVME_DEV" ]; then echo "ERROR: Could not find NVMe instance store device" - echo "Run 'lsblk' and set NVME_DEV manually" + echo "Available devices:" + lsblk exit 1 fi fi diff --git a/infra/ec2-userdata.sh b/infra/ec2-userdata.sh index 56bce25..078c29b 100755 --- a/infra/ec2-userdata.sh +++ b/infra/ec2-userdata.sh @@ -2,9 +2,11 @@ set -euo pipefail # EveryTab EC2 Bootstrap -# Run this on the EC2 instance after first SSH connection. +# Runs automatically via cloud-init user_data on first boot. # Installs: Go, DuckDB, Unbound, psql, pg_dump +export HOME=/root + echo "=== EveryTab EC2 Bootstrap ===" # --- File descriptor limits --- @@ -44,7 +46,7 @@ sudo dnf install -y \ echo "--- Installing Go ---" GO_VERSION="1.22.4" if ! command -v go &>/dev/null; then - curl -fsSL "https://go.dev/dl/go${GO_VERSION}.linux-amd64.tar.gz" | sudo tar -C /usr/local -xz + curl -fsSL "https://go.dev/dl/go$${GO_VERSION}.linux-amd64.tar.gz" | sudo tar -C /usr/local -xz echo 'export PATH=$PATH:/usr/local/go/bin:$HOME/go/bin' >> ~/.bashrc export PATH=$PATH:/usr/local/go/bin:$HOME/go/bin fi @@ -54,7 +56,7 @@ go version echo "--- Installing DuckDB ---" DUCKDB_VERSION="1.5.2" if ! command -v duckdb &>/dev/null; then - curl -fsSL "https://github.com/duckdb/duckdb/releases/download/v${DUCKDB_VERSION}/duckdb_cli-linux-amd64.zip" -o /tmp/duckdb.zip + curl -fsSL "https://github.com/duckdb/duckdb/releases/download/v$${DUCKDB_VERSION}/duckdb_cli-linux-amd64.zip" -o /tmp/duckdb.zip cd /tmp && unzip -o duckdb.zip && sudo mv duckdb /usr/local/bin/ && cd - fi duckdb -c "SELECT 'DuckDB OK';" @@ -151,7 +153,7 @@ if [ -n "$REPO_URL" ]; then cd /home/ec2-user/everytab echo "--- Building Go binaries ---" - sudo -u ec2-user bash -c 'export PATH=$PATH:/usr/local/go/bin && cd ~/everytab && go build -o ~/warc_parse ./pipeline/02_warc_parse/ && go build -o ~/icon_download ./pipeline/03_icon_download/ && go build -o ~/bundle_gen ./pipeline/05_bundle_gen/' + sudo -u ec2-user bash -c 'export PATH=/usr/local/go/bin:$PATH && cd /home/ec2-user/everytab && go build -o /home/ec2-user/warc_parse ./pipeline/02_warc_parse/ && go build -o /home/ec2-user/icon_download ./pipeline/03_icon_download/ && go build -o /home/ec2-user/bundle_gen ./pipeline/05_bundle_gen/' # Wait for DB to be ready, then apply schema echo "--- Waiting for database ---" diff --git a/infra/main.tf b/infra/main.tf index bb5a110..47ea942 100644 --- a/infra/main.tf +++ b/infra/main.tf @@ -218,6 +218,8 @@ resource "aws_s3_bucket_lifecycle_configuration" "logs" { id = "expire-old-logs" status = "Enabled" + filter {} + expiration { days = 365 }