teto_ai/docs/docker-compose-examples.md

# Docker Compose Examples for Local AI Stack

This document provides production-ready `docker-compose.yml` examples for setting up the self-hosted AI services required by the Teto AI Companion bot. These services should be included in the same `docker-compose.yml` file as the `teto_ai` bot service itself to ensure proper network communication.

> [!IMPORTANT]
> These examples require a host machine with an NVIDIA GPU and properly installed drivers. They use CDI (Container Device Interface) for GPU reservations, which is the modern standard for Docker.

## 🤖 vLLM Service (Language & Vision Model)

This service uses `vLLM` to serve a powerful language model with an OpenAI-compatible API endpoint. This allows Teto to perform natural language understanding and generation locally. If you use a multi-modal model, this service will also provide vision capabilities.

```yaml
services:
  vllm-openai:
    # This section reserves GPU resources for the container.
    # It ensures vLLM has exclusive access to the NVIDIA GPUs.
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids: ['nvidia.com/gpu=all']
              capabilities: ['gpu']
    # Mount local directories for model weights and cache.
    # This prevents re-downloading models on every container restart.
    volumes:
      - /path/to/your/llm_models/hf_cache:/root/.cache/huggingface
      - /path/to/your/llm_models:/root/LLM_models
    # Map the container's port 8000 to a host port (e.g., 11434).
    # Your .env file should point to this host port.
    ports:
      - "11434:8000"
    environment:
      # (Optional) Add your Hugging Face token if needed for private models.
      - HUGGING_FACE_HUB_TOKEN=your_hf_token_here
      # Optimizes PyTorch memory allocation, can improve performance.
      - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.8
    # Necessary for multi-GPU communication and performance.
    ipc: host
    image: vllm/vllm-openai:latest
    # --- vLLM Command Line Arguments ---
    # These arguments configure how vLLM serves the model.
    # Adjust them based on your model and hardware.
    command: >
      --model jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym
      --tensor-parallel-size 2          # Number of GPUs to use.
      --max-model-len 32256             # Maximum context length.
      --limit-mm-per-prompt image=4     # For multi-modal models.
      --enable-auto-tool-choice         # For models that support tool use.
      --tool-call-parser mistral
      --enable-chunked-prefill
      --disable-log-stats
      --gpu-memory-utilization 0.75     # Use 75% of GPU VRAM.
      --enable-prefix-caching
      --max-num-seqs 4                  # Max concurrent sequences.
      --served-model-name Mistral-Small-3.2
```

### vLLM Configuration Notes
-   **`--model`**: Specify the Hugging Face model identifier you want to serve.
-   **`--tensor-parallel-size`**: Set this to the number of GPUs you want to use for a single model. For a single GPU, this should be `1`.
-   **`--gpu-memory-utilization`**: Adjust this value based on your VRAM. `0.75` (75%) is a safe starting point.
-   Check the [official vLLM documentation](https://docs.vllm.ai/en/latest/) for the latest command-line arguments and supported models.

## 🎤 Wyoming Voice Services (Piper TTS & Whisper STT)

These services provide Text-to-Speech (`Piper`) and Speech-to-Text (`Whisper`) capabilities over the `Wyoming` protocol. They run as separate containers but are managed within the same Docker Compose file.

```yaml
services:
  # --- Whisper STT Service ---
  # Converts speech from the voice channel into text for Teto to understand.
  wyoming-whisper:
    image: slackr31337/wyoming-whisper-gpu:latest
    container_name: wyoming-whisper
    environment:
      # Configure the Whisper model size and language.
      # Smaller models are faster but less accurate.
      - MODEL=base-int8
      - LANGUAGE=en
      - COMPUTE_TYPE=int8
      - BEAM_SIZE=5
    ports:
      # Exposes the Wyoming protocol port for Whisper.
      - "10300:10300"
    volumes:
      # Mount a volume to persist Whisper model data.
      - /path/to/your/whisper_data:/data
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids: ['nvidia.com/gpu=all']
              capabilities: ['gpu']

  # --- Piper TTS Service ---
  # Converts Teto's text responses into speech.
  wyoming-piper:
    image: slackr31337/wyoming-piper-gpu:latest
    container_name: wyoming-piper
    environment:
      # Specify which Piper voice model to use.
      - PIPER_VOICE=en_US-amy-medium
    ports:
      # Exposes the Wyoming protocol port for Piper.
      - "10200:10200"
    volumes:
      # Mount a volume to persist Piper voice models.
      - /path/to/your/piper_data:/data
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids: ['nvidia.com/gpu=all']
              capabilities: ['gpu']
```

### Wyoming Configuration Notes
-   **Multiple Ports**: Note that `Whisper` and `Piper` listen on different ports (`10300` and `10200` in this example). Your bot's configuration will need to point to the correct service and port.
-   **Voice Models**: You can download different `Piper` voice models and place them in your persistent data directory to change Teto's voice.
-   **GPU Usage**: These images are for GPU-accelerated voice processing. If your GPU is dedicated to `vLLM`, you may consider using CPU-based images for Wyoming to conserve VRAM.

## 🌐 Networking

For the services to communicate with each other, they must share a Docker network. Using an external network is a good practice for managing complex applications.

```yaml
# Add this to the bottom of your docker-compose.yml file
networks:
  backend:
    external: true
```

Before starting your stack, create the network manually:
```bash
docker network create backend
```

Then, ensure each service in your `docker-compose.yml` (including the `teto_ai` bot) is attached to this network:

```yaml
services:
  teto_ai:
    # ... your bot's configuration
    networks:
      - backend

  vllm-openai:
    # ... vllm configuration
    networks:
      - backend

  wyoming-whisper:
    # ... whisper configuration
    networks:
      - backend

  wyoming-piper:
    # ... piper configuration
    networks:
      - backend
```
This allows the Teto bot to communicate with `vllm-openai`, `wyoming-whisper`, and `wyoming-piper` using their service names as hostnames.