Back to Blog

Run Any LLM with Ollama in Secure Sandboxes

TutorialsAmin Al Ali Al Darwish13 min read

Run Any LLM with Ollama in Secure Sandboxes

Want to run LLMs like Llama 3.3, Mixtral, or CodeLlama without sending data to third-party APIs? This guide shows you how to deploy Ollama in isolated HopX sandboxes—giving you the privacy of self-hosting with the simplicity of a managed service.

What you'll learn:

  • Deploy any Ollama model with ~100ms cold starts
  • Save up to 78% compared to pay-per-token APIs
  • Keep sensitive data in hardware-isolated environments
  • Scale from 1 to 1,000 sandboxes with the same code

Why Traditional LLM Deployment Costs You More Than Money

Running AI models in production introduces three critical problems:

Problem 1: Security Risks You Can't Afford

Container-based deployments share a host kernel. One escape path compromises your entire infrastructure. If your application handles sensitive data—medical records, financial transactions, or proprietary code—this shared-kernel architecture creates unacceptable risk.

Problem 2: Cold Starts Kill User Experience

Traditional containers take 10+ seconds to start. Every request waits while resources spin up. Your users see loading screens. Your AI agents sit idle. Productivity drops while infrastructure catches up.

Problem 3: Unpredictable Costs Drain Budgets

Cloud API pricing varies with token count. One complex query can cost 10x more than expected. Monthly bills fluctuate wildly. You can't forecast spending or optimize costs when pricing depends on factors outside your control.


The Solution: Ollama on Isolated Micro-VMs

Combining Ollama with HopX sandboxes solves these problems through a different architecture. Each model runs in its own micro-VM with dedicated kernel, file system, and network stack.

What changes:

  • Security: Each sandbox has its own kernel. No shared resources.
  • Speed: Sandboxes start in ~100ms from pre-built snapshots.
  • Cost: Pay per second of actual compute usage. Pause when idle.
  • Privacy: Your data never leaves your infrastructure.

Real Performance Numbers

MetricTraditional ContainersHopX Micro-VMs
Cold Start Time10-15 seconds~100 milliseconds
Kernel IsolationShared kernelDedicated kernel per VM
Runtime Limits15 minutes (typical)Hours to days
Startup CostFixed per invocation$0.000014/vCPU-second
Data ResidencyProvider-dependentYour choice of region

For high-volume workloads, self-managed micro-VMs can reduce costs by up to 78% compared to pay-per-token APIs.


Prerequisites

Before starting, you need:

  • A HopX account (sign up here for $200 in free credits)
  • Python 3.11+
  • Your HOPX_API_KEY from the dashboard

Set up your environment:

  1. Sign up at console.hopx.ai
  2. Get your API key from the dashboard
  3. Set the environment variable:
bash
1
export HOPX_API_KEY="your-api-key-here"
2
 

Step 1: Install Dependencies

bash
1
pip install hopx-ai python-dotenv asyncio
2
 

Step 2: Configure Environment

python
1
import os
2
import time
3
import asyncio
4
from dotenv import load_dotenv
5
 
6
# Load environment variables
7
load_dotenv()
8
 
9
# Verify API key is set
10
api_key = os.getenv("HOPX_API_KEY")
11
if not api_key:
12
    print("⚠️  HOPX_API_KEY not found in environment")
13
    print("Please set it: export HOPX_API_KEY=your-key-here")
14
else:
15
    print("✓ API key configured")
16
 

Step 3: Create Ollama Template

Templates define your sandbox environment. This template:

  • Starts with Python 3.13 base image
  • Installs Ollama
  • Pre-downloads your chosen model
  • Configures the environment for production use
python
1
from hopx_ai import Template
2
from hopx_ai.template.types import BuildOptions, BuildResult
3
 
4
# Configuration
5
OLLAMA_MODEL = "llama3.3"  # Change this to your preferred model
6
TEMPLATE_NAME = f"ollama-production-{int(time.time())}"
7
 
8
def create_ollama_template() -> Template:
9
    """Create a production-ready Ollama template."""
10
    return (
11
        Template()
12
        .from_python_image("3.13")
13
        .run_cmd("mkdir -p /workspace")
14
        .set_env("LANG", "en_US.UTF-8")
15
        .set_env("PYTHONUNBUFFERED", "1")
16
        .set_env("HOME", "/workspace")
17
        .run_cmd("curl -fsSL https://ollama.com/install.sh | sh")
18
        .run_cmd(f"/usr/local/bin/ollama pull {OLLAMA_MODEL}")
19
        .set_workdir("/workspace")
20
    )
21
 
22
def create_build_options(api_key: str) -> BuildOptions:
23
    """Configure build options for the template."""
24
    return BuildOptions(
25
        name=TEMPLATE_NAME,
26
        api_key=api_key,
27
        cpu=2,
28
        memory=2048,  # MB
29
        disk_gb=20,
30
        on_log=lambda log: print(f"[{log.get('level')}] {log.get('message')}"),
31
        on_progress=lambda p: print(f"Build progress: {p}%"),
32
    )
33
 
34
async def build_template() -> BuildResult:
35
    """Build the Ollama template."""
36
    template = create_ollama_template()
37
    options = create_build_options(os.getenv("HOPX_API_KEY"))
38
    print(f"Building template: {TEMPLATE_NAME}")
39
    return await Template.build(template, options)
40
 
41
print("✓ Template configuration ready")
42
 

Step 4: Build and Deploy Your First Sandbox

This step builds the template and creates a sandbox. Note: Building takes ~2 minutes the first time.

python
1
from hopx_ai import Sandbox
2
 
3
async def deploy_ollama_sandbox():
4
    """Deploy an Ollama sandbox."""
5
    # Build the template (do this once)
6
    print("Building template... (this takes ~2 minutes)")
7
    result = await build_template()
8
    print(f"✓ Template ready: {result.template_id}")
9
 
10
    # Create sandbox from template
11
    print("Creating sandbox...")
12
    sandbox = Sandbox.create(
13
        template=TEMPLATE_NAME,
14
        api_key=os.getenv("HOPX_API_KEY")
15
    )
16
    print(f"✓ Sandbox created: {sandbox.sandbox_id}")
17
 
18
    # Test with a simple prompt
19
    print("\nTesting model...")
20
    response = sandbox.commands.run(
21
        f"/usr/local/bin/ollama run {OLLAMA_MODEL} 'Explain quantum computing in one sentence'",
22
        timeout=240
23
    )
24
 
25
    print(f"\nModel response:\n{response.stdout}")
26
 
27
    return sandbox
28
 
29
# Run the deployment
30
sandbox = await deploy_ollama_sandbox()
31
 

Step 5: Persist and Reconnect to Sandboxes

Creating new sandboxes every time wastes resources. Save the sandbox ID and reconnect:

python
1
async def get_or_create_sandbox() -> Sandbox:
2
    """Get existing sandbox or create new one."""
3
    sandbox_file = ".hopx_sandbox_id"
4
    
5
    if os.path.exists(sandbox_file):
6
        with open(sandbox_file, "r") as f:
7
            sandbox_id = f.read().strip()
8
 
9
        try:
10
            sandbox = Sandbox.connect(
11
                sandbox_id,
12
                api_key=os.getenv("HOPX_API_KEY")
13
            )
14
            print(f"✓ Reconnected to sandbox: {sandbox_id}")
15
            return sandbox
16
        except Exception as e:
17
            print(f"Could not reconnect: {e}")
18
            print("Creating new sandbox...")
19
    
20
    # Build and create new sandbox
21
    template_result = await build_template()
22
    sandbox = Sandbox.create(
23
        template=TEMPLATE_NAME,
24
        api_key=os.getenv("HOPX_API_KEY")
25
    )
26
 
27
    with open(sandbox_file, "w") as f:
28
        f.write(sandbox.sandbox_id)
29
    print(f"✓ Created new sandbox: {sandbox.sandbox_id}")
30
 
31
    return sandbox
32
 

Choose the Right Ollama Model

For Speed and Efficiency

  • smollm (135M-1.7B): Minimal resources, great for testing
  • phi-3 (3.8B): Fast inference, good for classification
  • qwen2 (7B): Strong multilingual support

For Quality and Reasoning

  • llama3.3 (70B): Advanced reasoning and coding
  • mixtral (47B): Mixture-of-experts for specialized tasks
  • deepseek-r1 (70B): Advanced reasoning and problem-solving

For Code Generation

  • codellama (7B-34B): Optimized for programming
  • codegemma (7B): Google's code-focused model

Resource Requirements

  • 2 vCPU, 2GB RAM: Models up to 3B parameters
  • 4 vCPU, 8GB RAM: Models up to 13B parameters
  • 8 vCPU, 16GB RAM: Models up to 70B parameters

Cost Calculator: What You Actually Pay

HopX charges per second:

  • Compute: $0.000014 per vCPU-second
  • Memory: $0.0000045 per GiB-second
  • Storage: $0.00000003 per GiB-second
python
1
def calculate_cost(vcpu: int, memory_gb: int, storage_gb: int, hours: float) -> dict:
2
    """Calculate HopX sandbox costs."""
3
    seconds = hours * 3600
4
    
5
    compute_cost = vcpu * seconds * 0.000014
6
    memory_cost = memory_gb * seconds * 0.0000045
7
    storage_cost = storage_gb * seconds * 0.00000003
8
    
9
    total = compute_cost + memory_cost + storage_cost
10
    
11
    return {
12
        "compute": round(compute_cost, 4),
13
        "memory": round(memory_cost, 4),
14
        "storage": round(storage_cost, 4),
15
        "total": round(total, 4),
16
        "daily": round(total, 4),
17
        "monthly": round(total * 30, 2)
18
    }
19
 
20
# Example 1: Development Testing
21
print("Example 1: Development Testing (7B model, 30 min/day)")
22
dev_cost = calculate_cost(vcpu=2, memory_gb=4, storage_gb=20, hours=0.5)
23
print(f"  Daily cost: ${dev_cost['daily']}")
24
print(f"  Monthly cost: ${dev_cost['monthly']}")
25
 
26
# Example 2: Production AI Agent
27
print("Example 2: Production AI Agent (13B model, 8 hours/day)")
28
prod_cost = calculate_cost(vcpu=4, memory_gb=8, storage_gb=30, hours=8)
29
print(f"  Daily cost: ${prod_cost['daily']}")
30
print(f"  Monthly cost: ${prod_cost['monthly']}")
31
 
32
# Example 3: 24/7 Service
33
print("Example 3: High-Volume API (10 sandboxes, 24/7)")
34
service_cost = calculate_cost(vcpu=2, memory_gb=4, storage_gb=20, hours=24)
35
print(f"  Per sandbox daily: ${service_cost['daily']}")
36
print(f"  10 sandboxes monthly: ${service_cost['monthly'] * 10}")
37
 

Cost Optimization Patterns

python
1
# Pattern 1: Pause When Idle
2
# Paused sandboxes cost 10x less
3
 
4
def pause_sandbox_when_idle(sandbox: Sandbox):
5
    """Pause sandbox to reduce costs."""
6
    sandbox.pause()  # Preserves state, reduces costs
7
    print("Sandbox paused. Resume with sandbox.resume()")
8
 
9
# Pattern 2: Delete Completed Work
10
 
11
async def run_and_cleanup(sandbox: Sandbox, task: str):
12
    """Run task and clean up."""
13
    try:
14
        result = sandbox.commands.run(task)
15
        return result
16
    finally:
17
        sandbox.delete()  # Stop all charges
18
 
19
# Pattern 3: Choose Model by Complexity
20
 
21
def choose_model(complexity_score: float) -> str:
22
    """Choose model based on task complexity."""
23
    if complexity_score < 0.5:
24
        return "phi-3"  # Fast, cheap
25
    elif complexity_score < 0.8:
26
        return "llama3.3"  # Balanced
27
    else:
28
        return "mixtral"  # Heavy reasoning
29
 
30
# Pattern 4: Batch Requests
31
 
32
async def batch_process(sandbox: Sandbox, prompts: list[str], model: str):
33
    """Process multiple prompts in one session."""
34
    results = []
35
    for prompt in prompts:
36
        result = sandbox.commands.run(f"ollama run {model} '{prompt}'")
37
        results.append(result.stdout)
38
    # Delete sandbox after batch completes
39
    sandbox.delete()
40
    return results
41
 

Security Best Practices

Why Isolation Matters

Each HopX sandbox has:

  • Dedicated kernel: No shared kernel vulnerabilities
  • Isolated file system: No cross-sandbox file access
  • Separate network stack: Network policies per sandbox
  • Process tree isolation: Processes can't see other sandboxes

This matters for:

  • Healthcare: HIPAA-compliant patient data
  • Finance: PCI DSS requirements
  • Legal: Privileged document analysis
  • Enterprise: Proprietary code and trade secrets
python
1
# Handle Secrets Securely
2
 
3
def create_secure_sandbox():
4
    """Create sandbox with secure environment variables."""
5
    sandbox = Sandbox.create(
6
        template=TEMPLATE_NAME,
7
        api_key=os.getenv("HOPX_API_KEY"),
8
        env_vars={
9
            "DATABASE_URL": os.getenv("DATABASE_URL"),
10
            "API_SECRET": os.getenv("API_SECRET")
11
        }
12
    )
13
    return sandbox
14
 
15
# Choose Data Region
16
 
17
def create_regional_sandbox(region: str = "us-east"):
18
    """Create sandbox in specific region."""
19
    sandbox = Sandbox.create(
20
        template=TEMPLATE_NAME,
21
        api_key=os.getenv("HOPX_API_KEY"),
22
        region=region  # "us-east" or "eu-west"
23
    )
24
    return sandbox
25
 

Production Pattern: Long-Running AI Agent

python
1
async def run_ai_agent():
2
    """Run a long-running AI agent."""
3
    sandbox = await get_or_create_sandbox()
4
 
5
    # Agent runs continuously
6
    while True:
7
        # Get next task (implement your task queue here)
8
        task = get_next_task()  # Your implementation
9
 
10
        result = sandbox.commands.run(
11
            f"ollama run llama3.3 '{task.prompt}'",
12
            timeout=300
13
        )
14
 
15
        process_result(result.stdout)  # Your implementation
16
 
17
        # Check if we should continue
18
        if should_stop():  # Your implementation
19
            break
20
 
21
    # Pause instead of delete to preserve state
22
    sandbox.pause()
23
 

Production Pattern: Multi-Tenant Application

python
1
tenant_sandboxes = {}
2
 
3
def get_tenant_sandbox(tenant_id: str) -> Sandbox:
4
    """Get or create isolated sandbox for tenant."""
5
    if tenant_id not in tenant_sandboxes:
6
        sandbox = Sandbox.create(
7
            template=TEMPLATE_NAME,
8
            api_key=os.getenv("HOPX_API_KEY")
9
        )
10
        tenant_sandboxes[tenant_id] = sandbox
11
 
12
    return tenant_sandboxes[tenant_id]
13
 
14
# Example usage
15
tenant_a_sandbox = get_tenant_sandbox("tenant-a")
16
tenant_b_sandbox = get_tenant_sandbox("tenant-b")
17
 

Use Case: Private Document Analysis

python
1
async def analyze_documents(documents: list[str]) -> list[dict]:
2
    """Analyze sensitive documents privately."""
3
    sandbox = await get_or_create_sandbox()
4
 
5
    results = []
6
    for doc in documents:
7
        # Upload document to sandbox
8
        sandbox.files.write("/workspace/document.txt", doc)
9
 
10
        # Analyze with Ollama
11
        response = sandbox.commands.run(
12
            "ollama run llama3.3 'Summarize /workspace/document.txt'",
13
            timeout=180
14
        )
15
 
16
        results.append({
17
            "summary": response.stdout,
18
            "document": doc[:100]  # First 100 chars for reference
19
        })
20
 
21
    return results
22
 

Use Case: Code Generation and Testing

python
1
async def generate_and_test_code(specification: str):
2
    """Generate and test code in isolated environment."""
3
    sandbox = await get_or_create_sandbox()
4
 
5
    # Generate code
6
    code_response = sandbox.commands.run(
7
        f"ollama run codellama 'Write Python function: {specification}'",
8
        timeout=120
9
    )
10
 
11
    generated_code = code_response.stdout
12
 
13
    # Write to file
14
    sandbox.files.write("/workspace/generated.py", generated_code)
15
 
16
    # Test the code
17
    test_result = sandbox.commands.run(
18
        "python /workspace/generated.py",
19
        timeout=30
20
    )
21
 
22
    return {
23
        "code": generated_code,
24
        "test_output": test_result.stdout,
25
        "success": test_result.exit_code == 0
26
    }
27
 

Monitoring and Debugging

python
1
def monitor_sandbox(sandbox: Sandbox):
2
    """Monitor sandbox resource usage."""
3
    info = sandbox.get_info()
4
 
5
    print(f"Status: {info.status}")
6
    print(f"CPU cores: {info.cpu}")
7
    print(f"Memory: {info.memory}MB")
8
    print(f"Disk: {info.disk_gb}GB")
9
    print(f"Region: {info.region}")
10
    print(f"Created: {info.created_at}")
11
 
12
# Error handling with retry
13
 
14
async def run_with_retry(
15
    sandbox: Sandbox,
16
    command: str,
17
    max_retries: int = 3
18
) -> str:
19
    """Run command with exponential backoff retry."""
20
    for attempt in range(max_retries):
21
        try:
22
            result = sandbox.commands.run(command, timeout=120)
23
            return result.stdout
24
        except TimeoutError:
25
            if attempt == max_retries - 1:
26
                raise
27
            print(f"Timeout on attempt {attempt + 1}, retrying...")
28
            await asyncio.sleep(2 ** attempt)  # Exponential backoff
29
        except Exception as e:
30
            if attempt == max_retries - 1:
31
                raise
32
            print(f"Error on attempt {attempt + 1}: {e}, retrying...")
33
            await asyncio.sleep(2 ** attempt)
34
 

Troubleshooting Guide

Issue 1: Model Not Found

Problem: Error: model 'model-name' not found

Solution: Pull the model in your template:

python
1
.run_cmd("/usr/local/bin/ollama pull your-model-name")
2
 

Issue 2: Out of Memory

Problem: Sandbox crashes with memory errors

Solution: Increase memory in BuildOptions:

python
1
BuildOptions(memory=8192)  # Instead of 2048
2
 

Issue 3: Slow Response Times

Problem: Models take too long to respond

Solution: Use smaller models or increase CPU:

python
1
OLLAMA_MODEL = "phi-3"  # Faster model
2
BuildOptions(cpu=4)  # More CPU
3
 

Issue 4: Connection Timeouts

Problem: SDK times out connecting

Solution: Increase timeout:

python
1
sandbox.commands.run(command, timeout=300)
2
 

Quick Start Checklist

Step 1: Sign Up

  1. Visit console.hopx.ai
  2. Create account (no credit card required)
  3. Claim $200 in free credits
  4. Copy your API key from dashboard

Step 2: Setup

bash
1
pip install hopx-ai
2
export HOPX_API_KEY=your-key-here
3
 

Step 3: Deploy

  1. Build template (~2 minutes, one time)
  2. Create sandbox (~100ms)
  3. Run any Ollama model

What you get:

MetricValue
Build time~2 minutes (once)
Cold start~100ms
Runtime limitNone
Cost~$0.10/hour for 7B model

Comparing Your Options

ApproachCold StartIsolationCost ModelBest For
Cloud APIs (OpenAI, Anthropic)InstantProvider-managedPer-tokenLow volume, varied tasks
Self-Hosted VMsMinutesStrongFixed monthlyPredictable high volume
Containers (Docker)10+ secondsShared kernelFixed or per-secondDevelopment only
HopX + Ollama~100msHardware-levelPer-second usageVariable volume, privacy needs

Choose HopX + Ollama when you need:

  • Fast cold starts for user-facing applications
  • Strong isolation for sensitive data
  • Cost control through per-second billing
  • Freedom to switch models without vendor lock-in
  • Data privacy and regulatory compliance

Next Steps

You now have everything needed to run production LLMs in secure sandboxes.

What you learned:

  • Privacy: Data stays in environments you control
  • Speed: 100ms startup beats any container solution
  • Cost: Pay only for seconds of actual usage
  • Security: Hardware-level isolation protects sensitive workloads
  • Flexibility: Run any model, any size, any configuration

Get started:

  1. Use the $200 free credits
  2. Build your first template
  3. Test a few models
  4. See how the economics work for your use case

When ready to scale, the same code works for 10 sandboxes or 1,000.


Frequently Asked Questions

Can I use any Ollama model?

Yes. Any model in the Ollama library works—Llama 3.3, Mixtral, CodeLlama, Phi-3, DeepSeek, and more. Just change the OLLAMA_MODEL variable in your template and rebuild.

How much does it cost to run a model 24/7?

For a 7B model (2 vCPU, 4GB RAM, 20GB disk), expect around $2-3 per day. Larger models like 70B need more resources and cost proportionally more. Use sandbox.pause() when idle to reduce costs by 90%.

Is my data really private?

Yes. Each sandbox runs in its own micro-VM with dedicated kernel, filesystem, and network. Your prompts and outputs never leave the sandbox. You can also choose specific regions (US, EU) for data residency compliance.

How long can a sandbox run?

As long as you need—hours, days, or weeks. There are no 15-minute timeouts like AWS Lambda. You pay per second of runtime and can pause/resume to save costs.

Can I run multiple models in one sandbox?

Yes. Pull multiple models in your template, then switch between them at runtime with ollama run model-name. This is useful for routing simple queries to smaller models and complex ones to larger models.

What if my model is too slow?

Three options: (1) Use a smaller, faster model like Phi-3, (2) Increase vCPU count in BuildOptions, (3) Use quantized versions of models (q4 instead of full precision).


Resources

Getting Started

About HopX


Ready to run your own LLMs? Sign up for HopX and get $200 in free credits to start.