Compare commits
3 commits
33bd0a221e
...
4bfe165fac
| Author | SHA1 | Date | |
|---|---|---|---|
| 4bfe165fac | |||
| a92c838d23 | |||
| 7c5573c24d |
4 changed files with 52 additions and 63 deletions
|
|
@ -6,79 +6,65 @@ Two EC2 instances during scanning:
|
||||||
- **c5.2xlarge** (`everytab`) — compute: runs pipeline, stores icons on 1TB EBS
|
- **c5.2xlarge** (`everytab`) — compute: runs pipeline, stores icons on 1TB EBS
|
||||||
- **i3.large** (`everytab-db`) — database: runs Postgres on 475GB local NVMe (100K+ IOPS)
|
- **i3.large** (`everytab-db`) — database: runs Postgres on 475GB local NVMe (100K+ IOPS)
|
||||||
|
|
||||||
Both provisioned by Terraform with `user_data` scripts that run on first boot:
|
Both provisioned by Terraform with `user_data` scripts that auto-run on first boot:
|
||||||
- Compute: `ec2-userdata.sh` (Go, DuckDB, Unbound, swap)
|
- Compute: `ec2-userdata.sh` — installs Go, DuckDB, Unbound, swap; clones repo; builds binaries; applies DB schema
|
||||||
- Database: `db-setup.sh` (NVMe format, Postgres install + config)
|
- Database: `db-setup.sh` — formats NVMe, installs Postgres, creates database + schema
|
||||||
|
|
||||||
## 1. Terraform
|
## Quick Start
|
||||||
|
|
||||||
|
Everything runs from your local machine unless noted.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
# 1. Create infrastructure
|
||||||
cd infra
|
cd infra
|
||||||
cp terraform.tfvars.example terraform.tfvars # fill in your values
|
cp terraform.tfvars.example terraform.tfvars # fill in your values (including repo_url)
|
||||||
terraform init
|
terraform init
|
||||||
terraform apply
|
terraform apply
|
||||||
```
|
|
||||||
|
|
||||||
This creates both instances. They auto-provision via user_data (~3 minutes).
|
# 2. Save SSH key
|
||||||
|
|
||||||
## 2. SSH Key
|
|
||||||
|
|
||||||
```bash
|
|
||||||
terraform output -raw ssh_private_key > everytab-key && chmod 600 everytab-key
|
terraform output -raw ssh_private_key > everytab-key && chmod 600 everytab-key
|
||||||
terraform output ssh_command # SSH to compute instance
|
|
||||||
terraform output ssh_command_db # SSH to database instance
|
# 3. Wait ~3-5 minutes for both instances to auto-provision, then verify
|
||||||
|
ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip) \
|
||||||
|
'pg_isready -h $(grep DATABASE_URL ~/.bashrc | cut -d@ -f2 | cut -d: -f1)'
|
||||||
```
|
```
|
||||||
|
|
||||||
## 3. Verify Database is Ready
|
If `repo_url` is set in tfvars, the compute instance automatically:
|
||||||
|
- Clones the repo
|
||||||
|
- Builds all Go binaries
|
||||||
|
- Waits for the DB to be ready
|
||||||
|
- Applies the schema
|
||||||
|
|
||||||
```bash
|
## Running the Pipeline
|
||||||
# From your local machine or the compute instance
|
|
||||||
pg_isready -h $(terraform output -raw db_private_ip)
|
|
||||||
```
|
|
||||||
|
|
||||||
If not ready yet, SSH to the DB instance and check `cloud-init` logs:
|
SSH to the compute instance — everything is ready:
|
||||||
```bash
|
|
||||||
tail -f /var/log/cloud-init-output.log
|
|
||||||
```
|
|
||||||
|
|
||||||
## 4. Clone Repo + Build on Compute Instance
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip)
|
ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip)
|
||||||
|
|
||||||
git clone <your-repo-url> ~/everytab
|
# DATABASE_URL is already in .bashrc, binaries already built
|
||||||
cd ~/everytab
|
# Start the pipeline (see pipeline/README.md for full guide)
|
||||||
go build -o ~/warc_parse ./pipeline/02_warc_parse/
|
./pipeline/01_cc_index/query.sh --db-url "$DATABASE_URL" --limit 0
|
||||||
go build -o ~/icon_download ./pipeline/03_icon_download/
|
|
||||||
go build -o ~/bundle_gen ./pipeline/05_bundle_gen/
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## 5. Connect to Database + Apply Schema
|
## Debugging (if auto-provision fails)
|
||||||
|
|
||||||
|
Check cloud-init logs on either instance:
|
||||||
```bash
|
```bash
|
||||||
# Get the connection string
|
# Compute instance
|
||||||
export DATABASE_URL=$(terraform output -raw database_url)
|
ssh -i everytab-key ec2-user@$(terraform output -raw ec2_public_ip) \
|
||||||
echo "export DATABASE_URL='$DATABASE_URL'" >> ~/.bashrc
|
'tail -30 /var/log/cloud-init-output.log'
|
||||||
|
|
||||||
# Test connectivity
|
# DB instance
|
||||||
psql $DATABASE_URL -c 'SELECT 1;'
|
ssh -i everytab-key ec2-user@$(terraform output -raw db_public_ip) \
|
||||||
|
'tail -30 /var/log/cloud-init-output.log'
|
||||||
# Apply schema
|
|
||||||
psql $DATABASE_URL -f ~/everytab/pipeline/01_cc_index/schema.sql
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## 6. Run Pipeline
|
|
||||||
|
|
||||||
See `pipeline/README.md` for the full stage-by-stage guide.
|
|
||||||
|
|
||||||
## Pinning the EC2 AMI
|
## Pinning the EC2 AMI
|
||||||
|
|
||||||
The `data.aws_ami` lookup fetches the latest Amazon Linux 2023 AMI. If Amazon publishes a new one between applies, Terraform will want to replace your instances.
|
The `data.aws_ami` lookup fetches the latest Amazon Linux 2023 AMI. Pin it to prevent instance replacement on unrelated changes:
|
||||||
|
|
||||||
To prevent this, pin the AMI after initial creation:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Get the current AMI
|
|
||||||
aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \
|
aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \
|
||||||
--query "Reservations[0].Instances[0].ImageId" --output text
|
--query "Reservations[0].Instances[0].ImageId" --output text
|
||||||
|
|
||||||
|
|
@ -86,27 +72,27 @@ aws ec2 describe-instances --filters "Name=tag:Name,Values=everytab" \
|
||||||
echo 'ec2_ami = "ami-XXXXXXXXXXXX"' >> terraform.tfvars
|
echo 'ec2_ami = "ami-XXXXXXXXXXXX"' >> terraform.tfvars
|
||||||
```
|
```
|
||||||
|
|
||||||
Remove the `ec2_ami` line from tfvars when you want fresh instances with the latest AMI.
|
Remove the line when you want fresh instances with the latest AMI.
|
||||||
|
|
||||||
## Teardown (after backup)
|
## Teardown
|
||||||
|
|
||||||
|
From the compute instance, back up before tearing down:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Back up the database (run from compute instance)
|
# Back up database
|
||||||
pg_dump $DATABASE_URL -Fc > ~/everytab_dump.pgfc
|
pg_dump $DATABASE_URL -Fc > ~/everytab_dump.pgfc
|
||||||
|
|
||||||
# Back up icons to homelab
|
# Back up icons to homelab
|
||||||
rsync -avP ~/icons/ homelab:/backups/everytab/icons/
|
rsync -avP ~/icons/ homelab:/backups/everytab/icons/
|
||||||
```
|
```
|
||||||
|
|
||||||
Switch to serving-only mode (destroys both EC2 instances):
|
From your local machine:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
# Destroy scanning infrastructure (keeps CloudFront + site bucket)
|
||||||
terraform apply -var="scanning=false"
|
terraform apply -var="scanning=false"
|
||||||
```
|
|
||||||
|
|
||||||
Full destroy (including the live site):
|
# Or full destroy (including the live site)
|
||||||
|
|
||||||
```bash
|
|
||||||
terraform destroy
|
terraform destroy
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -15,14 +15,13 @@ NVME_DEV="/dev/nvme1n1"
|
||||||
NVME_MOUNT="/data"
|
NVME_MOUNT="/data"
|
||||||
|
|
||||||
if [ ! -d "$NVME_MOUNT" ]; then
|
if [ ! -d "$NVME_MOUNT" ]; then
|
||||||
# Find the NVMe instance store (not the root EBS)
|
# Find the NVMe instance store — look for unmounted nvme devices
|
||||||
# i3.large has one 475GB NVMe at /dev/nvme1n1 or similar
|
|
||||||
if [ ! -b "$NVME_DEV" ]; then
|
if [ ! -b "$NVME_DEV" ]; then
|
||||||
# Try finding it
|
NVME_DEV=$(lsblk -dpno NAME | grep nvme | head -1)
|
||||||
NVME_DEV=$(lsblk -dpno NAME,SIZE | grep -v "$(lsblk -dpno NAME /)" | head -1 | awk '{print $1}')
|
|
||||||
if [ -z "$NVME_DEV" ]; then
|
if [ -z "$NVME_DEV" ]; then
|
||||||
echo "ERROR: Could not find NVMe instance store device"
|
echo "ERROR: Could not find NVMe instance store device"
|
||||||
echo "Run 'lsblk' and set NVME_DEV manually"
|
echo "Available devices:"
|
||||||
|
lsblk
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
|
||||||
|
|
@ -2,9 +2,11 @@
|
||||||
set -euo pipefail
|
set -euo pipefail
|
||||||
|
|
||||||
# EveryTab EC2 Bootstrap
|
# EveryTab EC2 Bootstrap
|
||||||
# Run this on the EC2 instance after first SSH connection.
|
# Runs automatically via cloud-init user_data on first boot.
|
||||||
# Installs: Go, DuckDB, Unbound, psql, pg_dump
|
# Installs: Go, DuckDB, Unbound, psql, pg_dump
|
||||||
|
|
||||||
|
export HOME=/root
|
||||||
|
|
||||||
echo "=== EveryTab EC2 Bootstrap ==="
|
echo "=== EveryTab EC2 Bootstrap ==="
|
||||||
|
|
||||||
# --- File descriptor limits ---
|
# --- File descriptor limits ---
|
||||||
|
|
@ -44,7 +46,7 @@ sudo dnf install -y \
|
||||||
echo "--- Installing Go ---"
|
echo "--- Installing Go ---"
|
||||||
GO_VERSION="1.22.4"
|
GO_VERSION="1.22.4"
|
||||||
if ! command -v go &>/dev/null; then
|
if ! command -v go &>/dev/null; then
|
||||||
curl -fsSL "https://go.dev/dl/go${GO_VERSION}.linux-amd64.tar.gz" | sudo tar -C /usr/local -xz
|
curl -fsSL "https://go.dev/dl/go$${GO_VERSION}.linux-amd64.tar.gz" | sudo tar -C /usr/local -xz
|
||||||
echo 'export PATH=$PATH:/usr/local/go/bin:$HOME/go/bin' >> ~/.bashrc
|
echo 'export PATH=$PATH:/usr/local/go/bin:$HOME/go/bin' >> ~/.bashrc
|
||||||
export PATH=$PATH:/usr/local/go/bin:$HOME/go/bin
|
export PATH=$PATH:/usr/local/go/bin:$HOME/go/bin
|
||||||
fi
|
fi
|
||||||
|
|
@ -54,7 +56,7 @@ go version
|
||||||
echo "--- Installing DuckDB ---"
|
echo "--- Installing DuckDB ---"
|
||||||
DUCKDB_VERSION="1.5.2"
|
DUCKDB_VERSION="1.5.2"
|
||||||
if ! command -v duckdb &>/dev/null; then
|
if ! command -v duckdb &>/dev/null; then
|
||||||
curl -fsSL "https://github.com/duckdb/duckdb/releases/download/v${DUCKDB_VERSION}/duckdb_cli-linux-amd64.zip" -o /tmp/duckdb.zip
|
curl -fsSL "https://github.com/duckdb/duckdb/releases/download/v$${DUCKDB_VERSION}/duckdb_cli-linux-amd64.zip" -o /tmp/duckdb.zip
|
||||||
cd /tmp && unzip -o duckdb.zip && sudo mv duckdb /usr/local/bin/ && cd -
|
cd /tmp && unzip -o duckdb.zip && sudo mv duckdb /usr/local/bin/ && cd -
|
||||||
fi
|
fi
|
||||||
duckdb -c "SELECT 'DuckDB OK';"
|
duckdb -c "SELECT 'DuckDB OK';"
|
||||||
|
|
@ -151,7 +153,7 @@ if [ -n "$REPO_URL" ]; then
|
||||||
cd /home/ec2-user/everytab
|
cd /home/ec2-user/everytab
|
||||||
|
|
||||||
echo "--- Building Go binaries ---"
|
echo "--- Building Go binaries ---"
|
||||||
sudo -u ec2-user bash -c 'export PATH=$PATH:/usr/local/go/bin && cd ~/everytab && go build -o ~/warc_parse ./pipeline/02_warc_parse/ && go build -o ~/icon_download ./pipeline/03_icon_download/ && go build -o ~/bundle_gen ./pipeline/05_bundle_gen/'
|
sudo -u ec2-user bash -c 'export PATH=/usr/local/go/bin:$PATH && cd /home/ec2-user/everytab && go build -o /home/ec2-user/warc_parse ./pipeline/02_warc_parse/ && go build -o /home/ec2-user/icon_download ./pipeline/03_icon_download/ && go build -o /home/ec2-user/bundle_gen ./pipeline/05_bundle_gen/'
|
||||||
|
|
||||||
# Wait for DB to be ready, then apply schema
|
# Wait for DB to be ready, then apply schema
|
||||||
echo "--- Waiting for database ---"
|
echo "--- Waiting for database ---"
|
||||||
|
|
|
||||||
|
|
@ -218,6 +218,8 @@ resource "aws_s3_bucket_lifecycle_configuration" "logs" {
|
||||||
id = "expire-old-logs"
|
id = "expire-old-logs"
|
||||||
status = "Enabled"
|
status = "Enabled"
|
||||||
|
|
||||||
|
filter {}
|
||||||
|
|
||||||
expiration {
|
expiration {
|
||||||
days = 365
|
days = 365
|
||||||
}
|
}
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue