Home / AWS Research / Gemini

Build Your Own Cloud AI Dev Box

A Guide to Self-Hosting on AWS EC2

Ditch the SaaS fees and run your own powerful AI development environment on AWS. Get the convenience of cloud AI tools like Replit Ghostwriter or Cursor AI, but with full control, local models, and optimized costs.

This guide details how to set up a personal, cost-effective AI development server using AWS EC2, Ollama for running models, Code-Server for a browser-based VS Code experience, and Nginx for secure access.

Why Self-Host Your AI Assistant?

  • Control & Customization: Run any model compatible with Ollama or LocalGPT. Tweak settings, manage your data privately, and choose your hardware.
  • Cost Savings: Avoid recurring monthly SaaS fees. Pay only for the AWS resources you consume, significantly reduced by using Spot Instances and automated shutdowns during idle times.
  • Performance: Leverage powerful AWS GPU instances (like the G5 family) for much faster AI model inference compared to typical local machines.
  • Learning Opportunity: Gain valuable hands-on experience with cloud infrastructure (EC2, IAM, Nginx, Docker) and AI model deployment practices.

Choosing Your AWS Hardware: EC2 Instance Guide

The Importance of GPUs (Especially VRAM)

Running Large Language Models (LLMs) effectively is often limited by the amount of Video RAM (VRAM) available on the GPU. Model size (billions of parameters) directly impacts VRAM usage. Techniques like quantization (reducing the numerical precision of model weights) can significantly lower VRAM needs, but overly aggressive quantization might reduce model quality.

For developer tasks (code generation, complex instructions), maintaining model quality is key. Aim for enough VRAM to run your desired models with at least 4-bit or 5-bit quantization (e.g., Q4_K_M GGUF) for a good balance.

Estimated VRAM Needs (Quantized Models):

  • ~7-13B models (e.g., Llama 3 8B, Mistral 7B, DeepSeek Coder 6.7B, Phi-3-mini): ~4-12 GB (Q4)
  • Mixtral 8x7B: ~27-30 GB (Q4), ~24GB (3.5bpw EXL2)
  • Llama 3 70B: ~40-50 GB (Q4/Q5), ~75GB (Q8), ~140GB+ (FP16)

Note: Running large models like Llama 3 70B on GPUs with insufficient VRAM (e.g., 24GB) requires offloading layers to system RAM, which drastically slows down inference speed.

G5 Instances Recommended

AWS EC2 G5 instances, featuring NVIDIA A10G GPUs (24GB VRAM each), are generally the best choice for this workload. They offer significantly better ML inference performance and price/performance compared to the older G4dn instances (NVIDIA T4, 16GB VRAM). The extra VRAM per GPU is also a major advantage.

Recommendation: Prioritize G5 instances. Use G4dn only if budget is extremely tight and only small models are needed.

Recommended Instance Types (G5 Family)

Instance Type GPU & VRAM vCPUs System RAM Suitability
g5.xlarge 1 x A10G (24GB) 4 16 GiB Good start for models up to ~13B, potentially Mixtral (low quant/offload). Best cost/capability balance for smaller models.
g5.2xlarge 1 x A10G (24GB) 8 32 GiB Same VRAM, more CPU/RAM for smoother multitasking or minor offloading.
g5.4xlarge 1 x A10G (24GB) 16 64 GiB More RAM allows more significant layer offloading for larger models (e.g., Mixtral Q4, Llama 70B Q2/Q3), but expect performance impact.
g5.12xlarge 4 x A10G (96GB total) 48 192 GiB Recommended for running Llama 3 70B (Q4/Q5) efficiently without significant offloading. Also handles Mixtral easily.
g5.24xlarge 4 x A10G (96GB total) 96 384 GiB More CPU/RAM for demanding workloads or multiple users/models.
g5.48xlarge 8 x A10G (192GB total) 192 768 GiB High-end for multiple large models, less quantization, or very large context windows.

Spot vs. On-Demand Instances

Spot Instances offer huge savings (often 70-90% off On-Demand) by using spare AWS capacity. The catch is AWS can reclaim the instance with a 2-minute warning. For a personal dev server with auto-shutdown, Spot is highly recommended for cost savings.

On-Demand Instances provide guaranteed availability at a fixed hourly rate, suitable if interruptions are unacceptable.

Step-by-Step Setup Guide

1. Prerequisites

  • Active AWS Account with necessary permissions (EC2, IAM, etc.).
  • (Optional but Recommended) Registered domain name for easy access and SSL.
  • Launched EC2 Instance (choose from recommendations above).

2. Foundation: OS and Drivers

Highly Recommended: Start with an AWS Deep Learning AMI (DLAMI) based on Ubuntu (e.g., 22.04 or 24.04). These AMIs come pre-installed with compatible NVIDIA drivers, CUDA toolkit, cuDNN, and often Docker, saving significant setup effort.

If not using a DLAMI, you'll need to manually install NVIDIA drivers and the correct CUDA toolkit version compatible with Ollama.

3. Core Tools Installation

Ensure these are installed (DLAMIs often include them):

  • Docker & Docker Compose: For containerizing Ollama and Code-Server. Follow official Docker installation guides if needed.
  • NVIDIA Container Toolkit: Allows Docker containers to access the GPU. Installation involves adding NVIDIA's repository and installing the `nvidia-container-toolkit` package, then restarting Docker.
# Example NVIDIA Container Toolkit Install (Verify official docs)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify GPU access in Docker
sudo docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi

4. Deployment with Docker Compose

Create project directories (e.g., `~/ai-server/ollama_data`, `~/ai-server/code-server_data`) and a `docker-compose.yml` file within `~/ai-server`:

# ~/ai-server/docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "127.0.0.1:11434:11434" # IMPORTANT: Bind only to localhost for security via Nginx
    volumes:
      -./ollama_data:/root/.ollama # Persist models and data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all # Use all available GPUs
              capabilities: [gpu]
    restart: unless-stopped # Auto-start on reboot/crash

  code-server:
    image: codercom/code-server:latest
    container_name: code-server
    user: "$(id -u):$(id -g)" # Run as host user for permissions
    ports:
      - "127.0.0.1:8080:8080" # IMPORTANT: Bind only to localhost
    volumes:
      -./code-server_data:/home/coder/.local/share/code-server # Persist settings/extensions
      - $HOME:/home/coder/project # Mount home directory into Code-Server
    environment:
      - PASSWORD=YourStrongPasswordHere # CHANGE THIS!
    restart: unless-stopped # Auto-start on reboot/crash

volumes:
  ollama_data:
  code-server_data:

Security Alert: Change YourStrongPasswordHere to a very strong, unique password in the docker-compose.yml file.

Start the services:

cd ~/ai-server && sudo docker compose up -d

5. Pull Ollama Models

Download the models you want to use:

sudo docker exec ollama ollama pull llama3:8b
sudo docker exec ollama ollama pull mistral
sudo docker exec ollama ollama pull deepseek-coder:6.7b
# Add other models as needed (e.g., phi-3, starcoder, etc.)

Secure Access with Nginx & SSL

Why Nginx?

Nginx acts as a reverse proxy, providing a single, secure entry point:

  • Handles HTTPS (SSL/TLS encryption).
  • Hides the direct ports of Ollama and Code-Server.
  • Allows adding security headers centrally.
  • Exposes services via your domain name (or IP).

Nginx Installation & Configuration

Install Nginx:

sudo apt update && sudo apt install nginx

Create an Nginx site configuration file (e.g., /etc/nginx/sites-available/ai-server.conf). Replace your_domain.com with your actual domain or the EC2 instance's public IP address if not using a domain.

# /etc/nginx/sites-available/ai-server.conf

# Redirect HTTP to HTTPS (Certbot often handles this better)
server {
    listen 80;
    listen [::]:80;
    server_name your_domain.com; # CHANGE THIS

    location /.well-known/acme-challenge/ { # For Certbot validation
        root /var/www/html;
    }

    location / {
        return 301 https://$host$request_uri;
    }
}

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    server_name your_domain.com; # CHANGE THIS

    # SSL Configuration - Managed by Certbot (paths added automatically)
    ssl_certificate /etc/letsencrypt/live/your_domain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/your_domain.com/privkey.pem;

    # Basic Security Headers
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" always;
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header Referrer-Policy "strict-origin-when-cross-origin" always;
    add_header X-XSS-Protection "0" always; # Rely on CSP if implemented

    # Hide Nginx version
    server_tokens off;

    # Proxy pass to Code-Server (at root path '/')
    location / {
        proxy_pass http://127.0.0.1:8080/; # Forward to Code-Server
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support - CRITICAL for Code-Server
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }

    # Proxy pass to Ollama API (at /ollama/ path)
    location /ollama/ {
        proxy_pass http://127.0.0.1:11434/; # Forward to Ollama
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # Optional: Deny access to hidden files
    location ~ /\. {
        deny all;
    }
}

Enable the site, test config, and reload Nginx:

sudo ln -sfn /etc/nginx/sites-available/ai-server.conf /etc/nginx/sites-enabled/ai-server.conf
sudo nginx -t
sudo systemctl reload nginx

SSL Certificates with Let's Encrypt (Certbot)

If using a domain name, get a free SSL certificate:

  1. Install Certbot (snap method recommended):
    sudo snap install core; sudo snap refresh core
    sudo snap install --classic certbot
    sudo ln -s /snap/bin/certbot /usr/bin/certbot
  2. Ensure your domain's DNS A record points to the EC2 instance's public IP.
  3. Allow HTTPS traffic (port 443) in the EC2 Security Group.
  4. Run Certbot:
    sudo certbot --nginx -d your_domain.com
    (Replace `your_domain.com`). Follow the prompts. Certbot will obtain the certificate and automatically update your Nginx configuration.
  5. Verify auto-renewal: sudo certbot renew --dry-run.

Ensuring Reliability & Managing Costs

Auto-Start Services

The restart: unless-stopped policy in the `docker-compose.yml` file ensures that both Ollama and Code-Server containers will automatically restart if they crash or after the EC2 instance reboots. No further action is needed if using the provided Docker Compose setup.

(If installed manually without Docker, you would need to create and enable `systemd` service files for Ollama and Code-Server).

Automated Shutdown (Cost Saving)

Recommended Method: AWS Lambda + EventBridge Scheduler

This is the most secure and reliable way to automatically stop your EC2 instance during off-peak hours (e.g., 2 AM daily) to save costs.

  1. Create IAM Policy & Role: Create an IAM policy granting `ec2:StopInstances` permission specifically for your EC2 instance ARN, plus basic CloudWatch Logs permissions. Create an IAM role for Lambda execution and attach this policy.
  2. Create Lambda Function:
    • Go to the AWS Lambda console, create a new function (Author from scratch, Python runtime).
    • Assign the IAM role created above.
    • Use the following Python code (replace placeholders):
    # lambda_function.py
    import boto3
    import os
    
    # Define your instance ID and region
    REGION = 'us-east-1' # CHANGE if needed
    INSTANCE_IDS = ['i-xxxxxxxxxxxxxxxxx'] # *** REPLACE with your actual EC2 instance ID ***
    
    ec2 = boto3.client('ec2', region_name=REGION)
    
    def lambda_handler(event, context):
        if not INSTANCE_IDS:
            print("Error: No instance IDs specified.")
            return {'statusCode': 400, 'body': 'Instance ID not set'}
    
        try:
            print(f"Attempting to stop instances: {INSTANCE_IDS} in region {REGION}")
            ec2.stop_instances(InstanceIds=INSTANCE_IDS)
            print(f"Successfully initiated stop for instances: {INSTANCE_IDS}")
            return {'statusCode': 200, 'body': f'Stop initiated for {INSTANCE_IDS}'}
        except Exception as e:
            print(f"Error stopping instances: {e}")
            return {'statusCode': 500, 'body': f'Error: {str(e)}'}
                    
    • Deploy the function.
  3. Create EventBridge Schedule:
    • Go to the Amazon EventBridge console, select "Schedules", "Create schedule".
    • Define a schedule using a cron expression (remember it's in UTC). Example for 2:00 AM UTC daily: cron(0 2 ? * * *). Adjust for your desired local time shutdown.
    • Set the target to the Lambda function created above.
    • Create the schedule.

Alternative (Less Recommended): A cron job on the EC2 instance itself using AWS CLI, but this requires granting permissions directly to the instance.

Understanding the Costs

Main Cost Components

  • EC2 Instance Runtime: Largest cost, depends on instance type (G5), pricing model (Spot recommended), and hours running. Auto-shutdown drastically reduces this.
  • EBS Storage: Cost for the disk volume (recommend `gp3`). Charged per GB-month (e.g., ~$0.08/GB-month for gp3 in us-east-1). A 250GB volume costs ~$20/month.
  • Data Transfer OUT: First 100GB/month OUT to the internet is free. Additional data costs ~$0.09/GB. Usually negligible for personal use. Data IN is free.

Monthly Cost Estimate Sheet (Example: us-east-1, Linux)

Assumptions: 250GB gp3 EBS volume (~$20/month), <100GB data transfer ($0). Spot prices estimated ~70% off On-Demand (actual prices vary).

Scenario Instance Type Pricing Model Monthly Hours Est. EC2 Cost Est. EBS Cost Est. Total Monthly Cost
Budget Small Models g4dn.xlarge Spot 220 (Auto-Shutdown) ~$35 $20 ~$55
Recommended Start g5.xlarge Spot 220 (Auto-Shutdown) ~$66 $20 ~$86
Recommended Start (On-Demand) g5.xlarge On-Demand 220 (Auto-Shutdown) ~$221 $20 ~$241
Recommended 70B Models g5.12xlarge Spot 220 (Auto-Shutdown) ~$374 $20 ~$394
Baseline (No Shutdown) g5.xlarge Spot 720 (24/7) ~$216 $20 ~$236
Baseline (No Shutdown, OD) g5.xlarge On-Demand 720 (24/7) ~$724 $20 ~$744

Key Takeaway: Combining Spot Instances with Automated Shutdown provides massive cost savings compared to running On-Demand 24/7.

Conclusion & Next Steps

By following this guide, you can create a powerful, customizable, and cost-effective AI development environment on AWS EC2. You gain control over your tools and data, leverage potent cloud hardware, and avoid recurring SaaS fees.

Next Steps to Explore:

  • Experiment with different Ollama models and quantization levels.
  • Set up monitoring for your instance (CloudWatch).
  • Explore advanced Nginx features (e.g., basic auth for the Ollama endpoint).
  • Integrate LocalGPT alongside Ollama if you need to interact with local documents.
  • Automate the entire setup using Infrastructure as Code tools like Terraform or AWS CloudFormation.

Looking for Other Resources?

Check out my other AWS research pages:

Additional Resources