Run Any LLM with Ollama in Secure Sandboxes
Want to run LLMs like Llama 3.3, Mixtral, or CodeLlama without sending data to third-party APIs? This guide shows you how to deploy Ollama in isolated HopX sandboxes—giving you the privacy of self-hosting with the simplicity of a managed service.
What you'll learn:
- Deploy any Ollama model with ~100ms cold starts
- Save up to 78% compared to pay-per-token APIs
- Keep sensitive data in hardware-isolated environments
- Scale from 1 to 1,000 sandboxes with the same code
Why Traditional LLM Deployment Costs You More Than Money
Running AI models in production introduces three critical problems:
Problem 1: Security Risks You Can't Afford
Container-based deployments share a host kernel. One escape path compromises your entire infrastructure. If your application handles sensitive data—medical records, financial transactions, or proprietary code—this shared-kernel architecture creates unacceptable risk.
Problem 2: Cold Starts Kill User Experience
Traditional containers take 10+ seconds to start. Every request waits while resources spin up. Your users see loading screens. Your AI agents sit idle. Productivity drops while infrastructure catches up.
Problem 3: Unpredictable Costs Drain Budgets
Cloud API pricing varies with token count. One complex query can cost 10x more than expected. Monthly bills fluctuate wildly. You can't forecast spending or optimize costs when pricing depends on factors outside your control.
The Solution: Ollama on Isolated Micro-VMs
Combining Ollama with HopX sandboxes solves these problems through a different architecture. Each model runs in its own micro-VM with dedicated kernel, file system, and network stack.
What changes:
- Security: Each sandbox has its own kernel. No shared resources.
- Speed: Sandboxes start in ~100ms from pre-built snapshots.
- Cost: Pay per second of actual compute usage. Pause when idle.
- Privacy: Your data never leaves your infrastructure.
Real Performance Numbers
| Metric | Traditional Containers | HopX Micro-VMs |
|---|---|---|
| Cold Start Time | 10-15 seconds | ~100 milliseconds |
| Kernel Isolation | Shared kernel | Dedicated kernel per VM |
| Runtime Limits | 15 minutes (typical) | Hours to days |
| Startup Cost | Fixed per invocation | $0.000014/vCPU-second |
| Data Residency | Provider-dependent | Your choice of region |
For high-volume workloads, self-managed micro-VMs can reduce costs by up to 78% compared to pay-per-token APIs.
Prerequisites
Before starting, you need:
- A HopX account (sign up here for $200 in free credits)
- Python 3.11+
- Your
HOPX_API_KEYfrom the dashboard
Set up your environment:
- Sign up at console.hopx.ai
- Get your API key from the dashboard
- Set the environment variable:
| 1 | export HOPX_API_KEY="your-api-key-here" |
| 2 | |
Step 1: Install Dependencies
| 1 | pip install hopx-ai python-dotenv asyncio |
| 2 | |
Step 2: Configure Environment
| 1 | import os |
| 2 | import time |
| 3 | import asyncio |
| 4 | from dotenv import load_dotenv |
| 5 | |
| 6 | # Load environment variables |
| 7 | load_dotenv() |
| 8 | |
| 9 | # Verify API key is set |
| 10 | api_key = os.getenv("HOPX_API_KEY") |
| 11 | if not api_key: |
| 12 | print("⚠️ HOPX_API_KEY not found in environment") |
| 13 | print("Please set it: export HOPX_API_KEY=your-key-here") |
| 14 | else: |
| 15 | print("✓ API key configured") |
| 16 | |
Step 3: Create Ollama Template
Templates define your sandbox environment. This template:
- Starts with Python 3.13 base image
- Installs Ollama
- Pre-downloads your chosen model
- Configures the environment for production use
| 1 | from hopx_ai import Template |
| 2 | from hopx_ai.template.types import BuildOptions, BuildResult |
| 3 | |
| 4 | # Configuration |
| 5 | OLLAMA_MODEL = "llama3.3" # Change this to your preferred model |
| 6 | TEMPLATE_NAME = f"ollama-production-{int(time.time())}" |
| 7 | |
| 8 | def create_ollama_template() -> Template: |
| 9 | """Create a production-ready Ollama template.""" |
| 10 | return ( |
| 11 | Template() |
| 12 | .from_python_image("3.13") |
| 13 | .run_cmd("mkdir -p /workspace") |
| 14 | .set_env("LANG", "en_US.UTF-8") |
| 15 | .set_env("PYTHONUNBUFFERED", "1") |
| 16 | .set_env("HOME", "/workspace") |
| 17 | .run_cmd("curl -fsSL https://ollama.com/install.sh | sh") |
| 18 | .run_cmd(f"/usr/local/bin/ollama pull {OLLAMA_MODEL}") |
| 19 | .set_workdir("/workspace") |
| 20 | ) |
| 21 | |
| 22 | def create_build_options(api_key: str) -> BuildOptions: |
| 23 | """Configure build options for the template.""" |
| 24 | return BuildOptions( |
| 25 | name=TEMPLATE_NAME, |
| 26 | api_key=api_key, |
| 27 | cpu=2, |
| 28 | memory=2048, # MB |
| 29 | disk_gb=20, |
| 30 | on_log=lambda log: print(f"[{log.get('level')}] {log.get('message')}"), |
| 31 | on_progress=lambda p: print(f"Build progress: {p}%"), |
| 32 | ) |
| 33 | |
| 34 | async def build_template() -> BuildResult: |
| 35 | """Build the Ollama template.""" |
| 36 | template = create_ollama_template() |
| 37 | options = create_build_options(os.getenv("HOPX_API_KEY")) |
| 38 | print(f"Building template: {TEMPLATE_NAME}") |
| 39 | return await Template.build(template, options) |
| 40 | |
| 41 | print("✓ Template configuration ready") |
| 42 | |
Step 4: Build and Deploy Your First Sandbox
This step builds the template and creates a sandbox. Note: Building takes ~2 minutes the first time.
| 1 | from hopx_ai import Sandbox |
| 2 | |
| 3 | async def deploy_ollama_sandbox(): |
| 4 | """Deploy an Ollama sandbox.""" |
| 5 | # Build the template (do this once) |
| 6 | print("Building template... (this takes ~2 minutes)") |
| 7 | result = await build_template() |
| 8 | print(f"✓ Template ready: {result.template_id}") |
| 9 | |
| 10 | # Create sandbox from template |
| 11 | print("Creating sandbox...") |
| 12 | sandbox = Sandbox.create( |
| 13 | template=TEMPLATE_NAME, |
| 14 | api_key=os.getenv("HOPX_API_KEY") |
| 15 | ) |
| 16 | print(f"✓ Sandbox created: {sandbox.sandbox_id}") |
| 17 | |
| 18 | # Test with a simple prompt |
| 19 | print("\nTesting model...") |
| 20 | response = sandbox.commands.run( |
| 21 | f"/usr/local/bin/ollama run {OLLAMA_MODEL} 'Explain quantum computing in one sentence'", |
| 22 | timeout=240 |
| 23 | ) |
| 24 | |
| 25 | print(f"\nModel response:\n{response.stdout}") |
| 26 | |
| 27 | return sandbox |
| 28 | |
| 29 | # Run the deployment |
| 30 | sandbox = await deploy_ollama_sandbox() |
| 31 | |
Step 5: Persist and Reconnect to Sandboxes
Creating new sandboxes every time wastes resources. Save the sandbox ID and reconnect:
| 1 | async def get_or_create_sandbox() -> Sandbox: |
| 2 | """Get existing sandbox or create new one.""" |
| 3 | sandbox_file = ".hopx_sandbox_id" |
| 4 | |
| 5 | if os.path.exists(sandbox_file): |
| 6 | with open(sandbox_file, "r") as f: |
| 7 | sandbox_id = f.read().strip() |
| 8 | |
| 9 | try: |
| 10 | sandbox = Sandbox.connect( |
| 11 | sandbox_id, |
| 12 | api_key=os.getenv("HOPX_API_KEY") |
| 13 | ) |
| 14 | print(f"✓ Reconnected to sandbox: {sandbox_id}") |
| 15 | return sandbox |
| 16 | except Exception as e: |
| 17 | print(f"Could not reconnect: {e}") |
| 18 | print("Creating new sandbox...") |
| 19 | |
| 20 | # Build and create new sandbox |
| 21 | template_result = await build_template() |
| 22 | sandbox = Sandbox.create( |
| 23 | template=TEMPLATE_NAME, |
| 24 | api_key=os.getenv("HOPX_API_KEY") |
| 25 | ) |
| 26 | |
| 27 | with open(sandbox_file, "w") as f: |
| 28 | f.write(sandbox.sandbox_id) |
| 29 | print(f"✓ Created new sandbox: {sandbox.sandbox_id}") |
| 30 | |
| 31 | return sandbox |
| 32 | |
Choose the Right Ollama Model
For Speed and Efficiency
- smollm (135M-1.7B): Minimal resources, great for testing
- phi-3 (3.8B): Fast inference, good for classification
- qwen2 (7B): Strong multilingual support
For Quality and Reasoning
- llama3.3 (70B): Advanced reasoning and coding
- mixtral (47B): Mixture-of-experts for specialized tasks
- deepseek-r1 (70B): Advanced reasoning and problem-solving
For Code Generation
- codellama (7B-34B): Optimized for programming
- codegemma (7B): Google's code-focused model
Resource Requirements
- 2 vCPU, 2GB RAM: Models up to 3B parameters
- 4 vCPU, 8GB RAM: Models up to 13B parameters
- 8 vCPU, 16GB RAM: Models up to 70B parameters
Cost Calculator: What You Actually Pay
HopX charges per second:
- Compute: $0.000014 per vCPU-second
- Memory: $0.0000045 per GiB-second
- Storage: $0.00000003 per GiB-second
| 1 | def calculate_cost(vcpu: int, memory_gb: int, storage_gb: int, hours: float) -> dict: |
| 2 | """Calculate HopX sandbox costs.""" |
| 3 | seconds = hours * 3600 |
| 4 | |
| 5 | compute_cost = vcpu * seconds * 0.000014 |
| 6 | memory_cost = memory_gb * seconds * 0.0000045 |
| 7 | storage_cost = storage_gb * seconds * 0.00000003 |
| 8 | |
| 9 | total = compute_cost + memory_cost + storage_cost |
| 10 | |
| 11 | return { |
| 12 | "compute": round(compute_cost, 4), |
| 13 | "memory": round(memory_cost, 4), |
| 14 | "storage": round(storage_cost, 4), |
| 15 | "total": round(total, 4), |
| 16 | "daily": round(total, 4), |
| 17 | "monthly": round(total * 30, 2) |
| 18 | } |
| 19 | |
| 20 | # Example 1: Development Testing |
| 21 | print("Example 1: Development Testing (7B model, 30 min/day)") |
| 22 | dev_cost = calculate_cost(vcpu=2, memory_gb=4, storage_gb=20, hours=0.5) |
| 23 | print(f" Daily cost: ${dev_cost['daily']}") |
| 24 | print(f" Monthly cost: ${dev_cost['monthly']}") |
| 25 | |
| 26 | # Example 2: Production AI Agent |
| 27 | print("Example 2: Production AI Agent (13B model, 8 hours/day)") |
| 28 | prod_cost = calculate_cost(vcpu=4, memory_gb=8, storage_gb=30, hours=8) |
| 29 | print(f" Daily cost: ${prod_cost['daily']}") |
| 30 | print(f" Monthly cost: ${prod_cost['monthly']}") |
| 31 | |
| 32 | # Example 3: 24/7 Service |
| 33 | print("Example 3: High-Volume API (10 sandboxes, 24/7)") |
| 34 | service_cost = calculate_cost(vcpu=2, memory_gb=4, storage_gb=20, hours=24) |
| 35 | print(f" Per sandbox daily: ${service_cost['daily']}") |
| 36 | print(f" 10 sandboxes monthly: ${service_cost['monthly'] * 10}") |
| 37 | |
Cost Optimization Patterns
| 1 | # Pattern 1: Pause When Idle |
| 2 | # Paused sandboxes cost 10x less |
| 3 | |
| 4 | def pause_sandbox_when_idle(sandbox: Sandbox): |
| 5 | """Pause sandbox to reduce costs.""" |
| 6 | sandbox.pause() # Preserves state, reduces costs |
| 7 | print("Sandbox paused. Resume with sandbox.resume()") |
| 8 | |
| 9 | # Pattern 2: Delete Completed Work |
| 10 | |
| 11 | async def run_and_cleanup(sandbox: Sandbox, task: str): |
| 12 | """Run task and clean up.""" |
| 13 | try: |
| 14 | result = sandbox.commands.run(task) |
| 15 | return result |
| 16 | finally: |
| 17 | sandbox.delete() # Stop all charges |
| 18 | |
| 19 | # Pattern 3: Choose Model by Complexity |
| 20 | |
| 21 | def choose_model(complexity_score: float) -> str: |
| 22 | """Choose model based on task complexity.""" |
| 23 | if complexity_score < 0.5: |
| 24 | return "phi-3" # Fast, cheap |
| 25 | elif complexity_score < 0.8: |
| 26 | return "llama3.3" # Balanced |
| 27 | else: |
| 28 | return "mixtral" # Heavy reasoning |
| 29 | |
| 30 | # Pattern 4: Batch Requests |
| 31 | |
| 32 | async def batch_process(sandbox: Sandbox, prompts: list[str], model: str): |
| 33 | """Process multiple prompts in one session.""" |
| 34 | results = [] |
| 35 | for prompt in prompts: |
| 36 | result = sandbox.commands.run(f"ollama run {model} '{prompt}'") |
| 37 | results.append(result.stdout) |
| 38 | # Delete sandbox after batch completes |
| 39 | sandbox.delete() |
| 40 | return results |
| 41 | |
Security Best Practices
Why Isolation Matters
Each HopX sandbox has:
- Dedicated kernel: No shared kernel vulnerabilities
- Isolated file system: No cross-sandbox file access
- Separate network stack: Network policies per sandbox
- Process tree isolation: Processes can't see other sandboxes
This matters for:
- Healthcare: HIPAA-compliant patient data
- Finance: PCI DSS requirements
- Legal: Privileged document analysis
- Enterprise: Proprietary code and trade secrets
| 1 | # Handle Secrets Securely |
| 2 | |
| 3 | def create_secure_sandbox(): |
| 4 | """Create sandbox with secure environment variables.""" |
| 5 | sandbox = Sandbox.create( |
| 6 | template=TEMPLATE_NAME, |
| 7 | api_key=os.getenv("HOPX_API_KEY"), |
| 8 | env_vars={ |
| 9 | "DATABASE_URL": os.getenv("DATABASE_URL"), |
| 10 | "API_SECRET": os.getenv("API_SECRET") |
| 11 | } |
| 12 | ) |
| 13 | return sandbox |
| 14 | |
| 15 | # Choose Data Region |
| 16 | |
| 17 | def create_regional_sandbox(region: str = "us-east"): |
| 18 | """Create sandbox in specific region.""" |
| 19 | sandbox = Sandbox.create( |
| 20 | template=TEMPLATE_NAME, |
| 21 | api_key=os.getenv("HOPX_API_KEY"), |
| 22 | region=region # "us-east" or "eu-west" |
| 23 | ) |
| 24 | return sandbox |
| 25 | |
Production Pattern: Long-Running AI Agent
| 1 | async def run_ai_agent(): |
| 2 | """Run a long-running AI agent.""" |
| 3 | sandbox = await get_or_create_sandbox() |
| 4 | |
| 5 | # Agent runs continuously |
| 6 | while True: |
| 7 | # Get next task (implement your task queue here) |
| 8 | task = get_next_task() # Your implementation |
| 9 | |
| 10 | result = sandbox.commands.run( |
| 11 | f"ollama run llama3.3 '{task.prompt}'", |
| 12 | timeout=300 |
| 13 | ) |
| 14 | |
| 15 | process_result(result.stdout) # Your implementation |
| 16 | |
| 17 | # Check if we should continue |
| 18 | if should_stop(): # Your implementation |
| 19 | break |
| 20 | |
| 21 | # Pause instead of delete to preserve state |
| 22 | sandbox.pause() |
| 23 | |
Production Pattern: Multi-Tenant Application
| 1 | tenant_sandboxes = {} |
| 2 | |
| 3 | def get_tenant_sandbox(tenant_id: str) -> Sandbox: |
| 4 | """Get or create isolated sandbox for tenant.""" |
| 5 | if tenant_id not in tenant_sandboxes: |
| 6 | sandbox = Sandbox.create( |
| 7 | template=TEMPLATE_NAME, |
| 8 | api_key=os.getenv("HOPX_API_KEY") |
| 9 | ) |
| 10 | tenant_sandboxes[tenant_id] = sandbox |
| 11 | |
| 12 | return tenant_sandboxes[tenant_id] |
| 13 | |
| 14 | # Example usage |
| 15 | tenant_a_sandbox = get_tenant_sandbox("tenant-a") |
| 16 | tenant_b_sandbox = get_tenant_sandbox("tenant-b") |
| 17 | |
Use Case: Private Document Analysis
| 1 | async def analyze_documents(documents: list[str]) -> list[dict]: |
| 2 | """Analyze sensitive documents privately.""" |
| 3 | sandbox = await get_or_create_sandbox() |
| 4 | |
| 5 | results = [] |
| 6 | for doc in documents: |
| 7 | # Upload document to sandbox |
| 8 | sandbox.files.write("/workspace/document.txt", doc) |
| 9 | |
| 10 | # Analyze with Ollama |
| 11 | response = sandbox.commands.run( |
| 12 | "ollama run llama3.3 'Summarize /workspace/document.txt'", |
| 13 | timeout=180 |
| 14 | ) |
| 15 | |
| 16 | results.append({ |
| 17 | "summary": response.stdout, |
| 18 | "document": doc[:100] # First 100 chars for reference |
| 19 | }) |
| 20 | |
| 21 | return results |
| 22 | |
Use Case: Code Generation and Testing
| 1 | async def generate_and_test_code(specification: str): |
| 2 | """Generate and test code in isolated environment.""" |
| 3 | sandbox = await get_or_create_sandbox() |
| 4 | |
| 5 | # Generate code |
| 6 | code_response = sandbox.commands.run( |
| 7 | f"ollama run codellama 'Write Python function: {specification}'", |
| 8 | timeout=120 |
| 9 | ) |
| 10 | |
| 11 | generated_code = code_response.stdout |
| 12 | |
| 13 | # Write to file |
| 14 | sandbox.files.write("/workspace/generated.py", generated_code) |
| 15 | |
| 16 | # Test the code |
| 17 | test_result = sandbox.commands.run( |
| 18 | "python /workspace/generated.py", |
| 19 | timeout=30 |
| 20 | ) |
| 21 | |
| 22 | return { |
| 23 | "code": generated_code, |
| 24 | "test_output": test_result.stdout, |
| 25 | "success": test_result.exit_code == 0 |
| 26 | } |
| 27 | |
Monitoring and Debugging
| 1 | def monitor_sandbox(sandbox: Sandbox): |
| 2 | """Monitor sandbox resource usage.""" |
| 3 | info = sandbox.get_info() |
| 4 | |
| 5 | print(f"Status: {info.status}") |
| 6 | print(f"CPU cores: {info.cpu}") |
| 7 | print(f"Memory: {info.memory}MB") |
| 8 | print(f"Disk: {info.disk_gb}GB") |
| 9 | print(f"Region: {info.region}") |
| 10 | print(f"Created: {info.created_at}") |
| 11 | |
| 12 | # Error handling with retry |
| 13 | |
| 14 | async def run_with_retry( |
| 15 | sandbox: Sandbox, |
| 16 | command: str, |
| 17 | max_retries: int = 3 |
| 18 | ) -> str: |
| 19 | """Run command with exponential backoff retry.""" |
| 20 | for attempt in range(max_retries): |
| 21 | try: |
| 22 | result = sandbox.commands.run(command, timeout=120) |
| 23 | return result.stdout |
| 24 | except TimeoutError: |
| 25 | if attempt == max_retries - 1: |
| 26 | raise |
| 27 | print(f"Timeout on attempt {attempt + 1}, retrying...") |
| 28 | await asyncio.sleep(2 ** attempt) # Exponential backoff |
| 29 | except Exception as e: |
| 30 | if attempt == max_retries - 1: |
| 31 | raise |
| 32 | print(f"Error on attempt {attempt + 1}: {e}, retrying...") |
| 33 | await asyncio.sleep(2 ** attempt) |
| 34 | |
Troubleshooting Guide
Issue 1: Model Not Found
Problem: Error: model 'model-name' not found
Solution: Pull the model in your template:
| 1 | .run_cmd("/usr/local/bin/ollama pull your-model-name") |
| 2 | |
Issue 2: Out of Memory
Problem: Sandbox crashes with memory errors
Solution: Increase memory in BuildOptions:
| 1 | BuildOptions(memory=8192) # Instead of 2048 |
| 2 | |
Issue 3: Slow Response Times
Problem: Models take too long to respond
Solution: Use smaller models or increase CPU:
| 1 | OLLAMA_MODEL = "phi-3" # Faster model |
| 2 | BuildOptions(cpu=4) # More CPU |
| 3 | |
Issue 4: Connection Timeouts
Problem: SDK times out connecting
Solution: Increase timeout:
| 1 | sandbox.commands.run(command, timeout=300) |
| 2 | |
Quick Start Checklist
Step 1: Sign Up
- Visit console.hopx.ai
- Create account (no credit card required)
- Claim $200 in free credits
- Copy your API key from dashboard
Step 2: Setup
| 1 | pip install hopx-ai |
| 2 | export HOPX_API_KEY=your-key-here |
| 3 | |
Step 3: Deploy
- Build template (~2 minutes, one time)
- Create sandbox (~100ms)
- Run any Ollama model
What you get:
| Metric | Value |
|---|---|
| Build time | ~2 minutes (once) |
| Cold start | ~100ms |
| Runtime limit | None |
| Cost | ~$0.10/hour for 7B model |
Comparing Your Options
| Approach | Cold Start | Isolation | Cost Model | Best For |
|---|---|---|---|---|
| Cloud APIs (OpenAI, Anthropic) | Instant | Provider-managed | Per-token | Low volume, varied tasks |
| Self-Hosted VMs | Minutes | Strong | Fixed monthly | Predictable high volume |
| Containers (Docker) | 10+ seconds | Shared kernel | Fixed or per-second | Development only |
| HopX + Ollama | ~100ms | Hardware-level | Per-second usage | Variable volume, privacy needs |
Choose HopX + Ollama when you need:
- Fast cold starts for user-facing applications
- Strong isolation for sensitive data
- Cost control through per-second billing
- Freedom to switch models without vendor lock-in
- Data privacy and regulatory compliance
Next Steps
You now have everything needed to run production LLMs in secure sandboxes.
What you learned:
- Privacy: Data stays in environments you control
- Speed: 100ms startup beats any container solution
- Cost: Pay only for seconds of actual usage
- Security: Hardware-level isolation protects sensitive workloads
- Flexibility: Run any model, any size, any configuration
Get started:
- Use the $200 free credits
- Build your first template
- Test a few models
- See how the economics work for your use case
When ready to scale, the same code works for 10 sandboxes or 1,000.
Frequently Asked Questions
Can I use any Ollama model?
Yes. Any model in the Ollama library works—Llama 3.3, Mixtral, CodeLlama, Phi-3, DeepSeek, and more. Just change the OLLAMA_MODEL variable in your template and rebuild.
How much does it cost to run a model 24/7?
For a 7B model (2 vCPU, 4GB RAM, 20GB disk), expect around $2-3 per day. Larger models like 70B need more resources and cost proportionally more. Use sandbox.pause() when idle to reduce costs by 90%.
Is my data really private?
Yes. Each sandbox runs in its own micro-VM with dedicated kernel, filesystem, and network. Your prompts and outputs never leave the sandbox. You can also choose specific regions (US, EU) for data residency compliance.
How long can a sandbox run?
As long as you need—hours, days, or weeks. There are no 15-minute timeouts like AWS Lambda. You pay per second of runtime and can pause/resume to save costs.
Can I run multiple models in one sandbox?
Yes. Pull multiple models in your template, then switch between them at runtime with ollama run model-name. This is useful for routing simple queries to smaller models and complex ones to larger models.
What if my model is too slow?
Three options: (1) Use a smaller, faster model like Phi-3, (2) Increase vCPU count in BuildOptions, (3) Use quantized versions of models (q4 instead of full precision).
Resources
Getting Started
About HopX
Ready to run your own LLMs? Sign up for HopX and get $200 in free credits to start.