167 lines
No EOL
6.5 KiB
Markdown
167 lines
No EOL
6.5 KiB
Markdown
# Docker Compose Examples for Local AI Stack
|
|
|
|
This document provides production-ready `docker-compose.yml` examples for setting up the self-hosted AI services required by the Teto AI Companion bot. These services should be included in the same `docker-compose.yml` file as the `teto_ai` bot service itself to ensure proper network communication.
|
|
|
|
> [!IMPORTANT]
|
|
> These examples require a host machine with an NVIDIA GPU and properly installed drivers. They use CDI (Container Device Interface) for GPU reservations, which is the modern standard for Docker.
|
|
|
|
## 🤖 vLLM Service (Language & Vision Model)
|
|
|
|
This service uses `vLLM` to serve a powerful language model with an OpenAI-compatible API endpoint. This allows Teto to perform natural language understanding and generation locally. If you use a multi-modal model, this service will also provide vision capabilities.
|
|
|
|
```yaml
|
|
services:
|
|
vllm-openai:
|
|
# This section reserves GPU resources for the container.
|
|
# It ensures vLLM has exclusive access to the NVIDIA GPUs.
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: cdi
|
|
device_ids: ['nvidia.com/gpu=all']
|
|
capabilities: ['gpu']
|
|
# Mount local directories for model weights and cache.
|
|
# This prevents re-downloading models on every container restart.
|
|
volumes:
|
|
- /path/to/your/llm_models/hf_cache:/root/.cache/huggingface
|
|
- /path/to/your/llm_models:/root/LLM_models
|
|
# Map the container's port 8000 to a host port (e.g., 11434).
|
|
# Your .env file should point to this host port.
|
|
ports:
|
|
- "11434:8000"
|
|
environment:
|
|
# (Optional) Add your Hugging Face token if needed for private models.
|
|
- HUGGING_FACE_HUB_TOKEN=your_hf_token_here
|
|
# Optimizes PyTorch memory allocation, can improve performance.
|
|
- PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.8
|
|
# Necessary for multi-GPU communication and performance.
|
|
ipc: host
|
|
image: vllm/vllm-openai:latest
|
|
# --- vLLM Command Line Arguments ---
|
|
# These arguments configure how vLLM serves the model.
|
|
# Adjust them based on your model and hardware.
|
|
command: >
|
|
--model jeffcookio/Mistral-Small-3.2-24B-Instruct-2506-awq-sym
|
|
--tensor-parallel-size 2 # Number of GPUs to use.
|
|
--max-model-len 32256 # Maximum context length.
|
|
--limit-mm-per-prompt image=4 # For multi-modal models.
|
|
--enable-auto-tool-choice # For models that support tool use.
|
|
--tool-call-parser mistral
|
|
--enable-chunked-prefill
|
|
--disable-log-stats
|
|
--gpu-memory-utilization 0.75 # Use 75% of GPU VRAM.
|
|
--enable-prefix-caching
|
|
--max-num-seqs 4 # Max concurrent sequences.
|
|
--served-model-name Mistral-Small-3.2
|
|
```
|
|
|
|
### vLLM Configuration Notes
|
|
- **`--model`**: Specify the Hugging Face model identifier you want to serve.
|
|
- **`--tensor-parallel-size`**: Set this to the number of GPUs you want to use for a single model. For a single GPU, this should be `1`.
|
|
- **`--gpu-memory-utilization`**: Adjust this value based on your VRAM. `0.75` (75%) is a safe starting point.
|
|
- Check the [official vLLM documentation](https://docs.vllm.ai/en/latest/) for the latest command-line arguments and supported models.
|
|
|
|
## 🎤 Wyoming Voice Services (Piper TTS & Whisper STT)
|
|
|
|
These services provide Text-to-Speech (`Piper`) and Speech-to-Text (`Whisper`) capabilities over the `Wyoming` protocol. They run as separate containers but are managed within the same Docker Compose file.
|
|
|
|
```yaml
|
|
services:
|
|
# --- Whisper STT Service ---
|
|
# Converts speech from the voice channel into text for Teto to understand.
|
|
wyoming-whisper:
|
|
image: slackr31337/wyoming-whisper-gpu:latest
|
|
container_name: wyoming-whisper
|
|
environment:
|
|
# Configure the Whisper model size and language.
|
|
# Smaller models are faster but less accurate.
|
|
- MODEL=base-int8
|
|
- LANGUAGE=en
|
|
- COMPUTE_TYPE=int8
|
|
- BEAM_SIZE=5
|
|
ports:
|
|
# Exposes the Wyoming protocol port for Whisper.
|
|
- "10300:10300"
|
|
volumes:
|
|
# Mount a volume to persist Whisper model data.
|
|
- /path/to/your/whisper_data:/data
|
|
restart: unless-stopped
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: cdi
|
|
device_ids: ['nvidia.com/gpu=all']
|
|
capabilities: ['gpu']
|
|
|
|
# --- Piper TTS Service ---
|
|
# Converts Teto's text responses into speech.
|
|
wyoming-piper:
|
|
image: slackr31337/wyoming-piper-gpu:latest
|
|
container_name: wyoming-piper
|
|
environment:
|
|
# Specify which Piper voice model to use.
|
|
- PIPER_VOICE=en_US-amy-medium
|
|
ports:
|
|
# Exposes the Wyoming protocol port for Piper.
|
|
- "10200:10200"
|
|
volumes:
|
|
# Mount a volume to persist Piper voice models.
|
|
- /path/to/your/piper_data:/data
|
|
restart: unless-stopped
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: cdi
|
|
device_ids: ['nvidia.com/gpu=all']
|
|
capabilities: ['gpu']
|
|
```
|
|
|
|
### Wyoming Configuration Notes
|
|
- **Multiple Ports**: Note that `Whisper` and `Piper` listen on different ports (`10300` and `10200` in this example). Your bot's configuration will need to point to the correct service and port.
|
|
- **Voice Models**: You can download different `Piper` voice models and place them in your persistent data directory to change Teto's voice.
|
|
- **GPU Usage**: These images are for GPU-accelerated voice processing. If your GPU is dedicated to `vLLM`, you may consider using CPU-based images for Wyoming to conserve VRAM.
|
|
|
|
## 🌐 Networking
|
|
|
|
For the services to communicate with each other, they must share a Docker network. Using an external network is a good practice for managing complex applications.
|
|
|
|
```yaml
|
|
# Add this to the bottom of your docker-compose.yml file
|
|
networks:
|
|
backend:
|
|
external: true
|
|
```
|
|
|
|
Before starting your stack, create the network manually:
|
|
```bash
|
|
docker network create backend
|
|
```
|
|
|
|
Then, ensure each service in your `docker-compose.yml` (including the `teto_ai` bot) is attached to this network:
|
|
|
|
```yaml
|
|
services:
|
|
teto_ai:
|
|
# ... your bot's configuration
|
|
networks:
|
|
- backend
|
|
|
|
vllm-openai:
|
|
# ... vllm configuration
|
|
networks:
|
|
- backend
|
|
|
|
wyoming-whisper:
|
|
# ... whisper configuration
|
|
networks:
|
|
- backend
|
|
|
|
wyoming-piper:
|
|
# ... piper configuration
|
|
networks:
|
|
- backend
|
|
```
|
|
This allows the Teto bot to communicate with `vllm-openai`, `wyoming-whisper`, and `wyoming-piper` using their service names as hostnames. |