teto_ai/docs/docker-compose-examples.md

6.5 KiB

Docker Compose Examples for Local AI Stack

This document provides production-ready docker-compose.yml examples for setting up the self-hosted AI services required by the Teto AI Companion bot. These services should be included in the same docker-compose.yml file as the teto_ai bot service itself to ensure proper network communication.

Important

These examples require a host machine with an NVIDIA GPU and properly installed drivers. They use CDI (Container Device Interface) for GPU reservations, which is the modern standard for Docker.

🤖 vLLM Service (Language & Vision Model)

This service uses vLLM to serve a powerful language model with an OpenAI-compatible API endpoint. This allows Teto to perform natural language understanding and generation locally. If you use a multi-modal model, this service will also provide vision capabilities.

services:
  vllm-openai:
    # This section reserves GPU resources for the container.
    # It ensures vLLM has exclusive access to the NVIDIA GPUs.
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids: ['nvidia.com/gpu=all']
              capabilities: ['gpu']
    # Mount local directories for model weights and cache.
    # This prevents re-downloading models on every container restart.
    volumes:
      - /path/to/your/llm_models/hf_cache:/root/.cache/huggingface
      - /path/to/your/llm_models:/root/LLM_models
    # Map the container's port 8000 to a host port (e.g., 11434).
    # Your .env file should point to this host port.
    ports:
      - "11434:8000"
    environment:
      # (Optional) Add your Hugging Face token if needed for private models.
      - HUGGING_FACE_HUB_TOKEN=your_hf_token_here
      # Optimizes PyTorch memory allocation, can improve performance.
      - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.8
    # Necessary for multi-GPU communication and performance.
    ipc: host
    image: vllm/vllm-openai:latest
    # --- vLLM Command Line Arguments ---
    # These arguments configure how vLLM serves the model.
    # Adjust them based on your model and hardware.
    command: >
      --model jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym
      --tensor-parallel-size 2          # Number of GPUs to use.
      --max-model-len 32256             # Maximum context length.
      --limit-mm-per-prompt image=4     # For multi-modal models.
      --enable-auto-tool-choice         # For models that support tool use.
      --tool-call-parser mistral
      --enable-chunked-prefill
      --disable-log-stats
      --gpu-memory-utilization 0.75     # Use 75% of GPU VRAM.
      --enable-prefix-caching
      --max-num-seqs 4                  # Max concurrent sequences.
      --served-model-name Mistral-Small-3.2      

vLLM Configuration Notes

  • --model: Specify the Hugging Face model identifier you want to serve.
  • --tensor-parallel-size: Set this to the number of GPUs you want to use for a single model. For a single GPU, this should be 1.
  • --gpu-memory-utilization: Adjust this value based on your VRAM. 0.75 (75%) is a safe starting point.
  • Check the official vLLM documentation for the latest command-line arguments and supported models.

🎤 Wyoming Voice Services (Piper TTS & Whisper STT)

These services provide Text-to-Speech (Piper) and Speech-to-Text (Whisper) capabilities over the Wyoming protocol. They run as separate containers but are managed within the same Docker Compose file.

services:
  # --- Whisper STT Service ---
  # Converts speech from the voice channel into text for Teto to understand.
  wyoming-whisper:
    image: slackr31337/wyoming-whisper-gpu:latest
    container_name: wyoming-whisper
    environment:
      # Configure the Whisper model size and language.
      # Smaller models are faster but less accurate.
      - MODEL=base-int8
      - LANGUAGE=en
      - COMPUTE_TYPE=int8
      - BEAM_SIZE=5
    ports:
      # Exposes the Wyoming protocol port for Whisper.
      - "10300:10300"
    volumes:
      # Mount a volume to persist Whisper model data.
      - /path/to/your/whisper_data:/data
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids: ['nvidia.com/gpu=all']
              capabilities: ['gpu']

  # --- Piper TTS Service ---
  # Converts Teto's text responses into speech.
  wyoming-piper:
    image: slackr31337/wyoming-piper-gpu:latest
    container_name: wyoming-piper
    environment:
      # Specify which Piper voice model to use.
      - PIPER_VOICE=en_US-amy-medium
    ports:
      # Exposes the Wyoming protocol port for Piper.
      - "10200:10200"
    volumes:
      # Mount a volume to persist Piper voice models.
      - /path/to/your/piper_data:/data
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids: ['nvidia.com/gpu=all']
              capabilities: ['gpu']

Wyoming Configuration Notes

  • Multiple Ports: Note that Whisper and Piper listen on different ports (10300 and 10200 in this example). Your bot's configuration will need to point to the correct service and port.
  • Voice Models: You can download different Piper voice models and place them in your persistent data directory to change Teto's voice.
  • GPU Usage: These images are for GPU-accelerated voice processing. If your GPU is dedicated to vLLM, you may consider using CPU-based images for Wyoming to conserve VRAM.

🌐 Networking

For the services to communicate with each other, they must share a Docker network. Using an external network is a good practice for managing complex applications.

# Add this to the bottom of your docker-compose.yml file
networks:
  backend:
    external: true

Before starting your stack, create the network manually:

docker network create backend

Then, ensure each service in your docker-compose.yml (including the teto_ai bot) is attached to this network:

services:
  teto_ai:
    # ... your bot's configuration
    networks:
      - backend

  vllm-openai:
    # ... vllm configuration
    networks:
      - backend

  wyoming-whisper:
    # ... whisper configuration
    networks:
      - backend

  wyoming-piper:
    # ... piper configuration
    networks:
      - backend

This allows the Teto bot to communicate with vllm-openai, wyoming-whisper, and wyoming-piper using their service names as hostnames.