# Docker Compose Examples for Local AI Stack This document provides production-ready `docker-compose.yml` examples for setting up the self-hosted AI services required by the Teto AI Companion bot. These services should be included in the same `docker-compose.yml` file as the `teto_ai` bot service itself to ensure proper network communication. > [!IMPORTANT] > These examples require a host machine with an NVIDIA GPU and properly installed drivers. They use CDI (Container Device Interface) for GPU reservations, which is the modern standard for Docker. ## 🤖 vLLM Service (Language & Vision Model) This service uses `vLLM` to serve a powerful language model with an OpenAI-compatible API endpoint. This allows Teto to perform natural language understanding and generation locally. If you use a multi-modal model, this service will also provide vision capabilities. ```yaml services: vllm-openai: # This section reserves GPU resources for the container. # It ensures vLLM has exclusive access to the NVIDIA GPUs. deploy: resources: reservations: devices: - driver: cdi device_ids: ['nvidia.com/gpu=all'] capabilities: ['gpu'] # Mount local directories for model weights and cache. # This prevents re-downloading models on every container restart. volumes: - /path/to/your/llm_models/hf_cache:/root/.cache/huggingface - /path/to/your/llm_models:/root/LLM_models # Map the container's port 8000 to a host port (e.g., 11434). # Your .env file should point to this host port. ports: - "11434:8000" environment: # (Optional) Add your Hugging Face token if needed for private models. - HUGGING_FACE_HUB_TOKEN=your_hf_token_here # Optimizes PyTorch memory allocation, can improve performance. - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.8 # Necessary for multi-GPU communication and performance. ipc: host image: vllm/vllm-openai:latest # --- vLLM Command Line Arguments --- # These arguments configure how vLLM serves the model. # Adjust them based on your model and hardware. command: > --model jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym --tensor-parallel-size 2 # Number of GPUs to use. --max-model-len 32256 # Maximum context length. --limit-mm-per-prompt image=4 # For multi-modal models. --enable-auto-tool-choice # For models that support tool use. --tool-call-parser mistral --enable-chunked-prefill --disable-log-stats --gpu-memory-utilization 0.75 # Use 75% of GPU VRAM. --enable-prefix-caching --max-num-seqs 4 # Max concurrent sequences. --served-model-name Mistral-Small-3.2 ``` ### vLLM Configuration Notes - **`--model`**: Specify the Hugging Face model identifier you want to serve. - **`--tensor-parallel-size`**: Set this to the number of GPUs you want to use for a single model. For a single GPU, this should be `1`. - **`--gpu-memory-utilization`**: Adjust this value based on your VRAM. `0.75` (75%) is a safe starting point. - Check the [official vLLM documentation](https://docs.vllm.ai/en/latest/) for the latest command-line arguments and supported models. ## 🎤 Wyoming Voice Services (Piper TTS & Whisper STT) These services provide Text-to-Speech (`Piper`) and Speech-to-Text (`Whisper`) capabilities over the `Wyoming` protocol. They run as separate containers but are managed within the same Docker Compose file. ```yaml services: # --- Whisper STT Service --- # Converts speech from the voice channel into text for Teto to understand. wyoming-whisper: image: slackr31337/wyoming-whisper-gpu:latest container_name: wyoming-whisper environment: # Configure the Whisper model size and language. # Smaller models are faster but less accurate. - MODEL=base-int8 - LANGUAGE=en - COMPUTE_TYPE=int8 - BEAM_SIZE=5 ports: # Exposes the Wyoming protocol port for Whisper. - "10300:10300" volumes: # Mount a volume to persist Whisper model data. - /path/to/your/whisper_data:/data restart: unless-stopped deploy: resources: reservations: devices: - driver: cdi device_ids: ['nvidia.com/gpu=all'] capabilities: ['gpu'] # --- Piper TTS Service --- # Converts Teto's text responses into speech. wyoming-piper: image: slackr31337/wyoming-piper-gpu:latest container_name: wyoming-piper environment: # Specify which Piper voice model to use. - PIPER_VOICE=en_US-amy-medium ports: # Exposes the Wyoming protocol port for Piper. - "10200:10200" volumes: # Mount a volume to persist Piper voice models. - /path/to/your/piper_data:/data restart: unless-stopped deploy: resources: reservations: devices: - driver: cdi device_ids: ['nvidia.com/gpu=all'] capabilities: ['gpu'] ``` ### Wyoming Configuration Notes - **Multiple Ports**: Note that `Whisper` and `Piper` listen on different ports (`10300` and `10200` in this example). Your bot's configuration will need to point to the correct service and port. - **Voice Models**: You can download different `Piper` voice models and place them in your persistent data directory to change Teto's voice. - **GPU Usage**: These images are for GPU-accelerated voice processing. If your GPU is dedicated to `vLLM`, you may consider using CPU-based images for Wyoming to conserve VRAM. ## 🌐 Networking For the services to communicate with each other, they must share a Docker network. Using an external network is a good practice for managing complex applications. ```yaml # Add this to the bottom of your docker-compose.yml file networks: backend: external: true ``` Before starting your stack, create the network manually: ```bash docker network create backend ``` Then, ensure each service in your `docker-compose.yml` (including the `teto_ai` bot) is attached to this network: ```yaml services: teto_ai: # ... your bot's configuration networks: - backend vllm-openai: # ... vllm configuration networks: - backend wyoming-whisper: # ... whisper configuration networks: - backend wyoming-piper: # ... piper configuration networks: - backend ``` This allows the Teto bot to communicate with `vllm-openai`, `wyoming-whisper`, and `wyoming-piper` using their service names as hostnames.