Mikolaj Wojciech Gorski 2e94820164 Updated the docs to focus on a local only stack instead of one relient on services like OpenAI, Eleven labs and so on.

2025-07-26 14:26:18 +02:00

17 KiB

Raw Blame History

AI Architecture Overview

This document provides a comprehensive overview of how Kasane Teto's AI systems work together to create a natural, engaging, and authentic virtual companion experience.

🧠 System Architecture

High-Level Overview

┌─────────────────────────────────────────────────────────────┐
│                    Discord Interface Layer                   │
├─────────────────────────────────────────────────────────────┤
│  Event Processing  │  Command Routing  │  Response Handling │
├─────────────────────────────────────────────────────────────┤
│                       AI Orchestration                       │
├─────────────────────────────────────────────────────────────┤
│    Language     │    Vision      │    Voice     │  Memory   │
│    Model        │    System      │    System    │  System   │
├─────────────────────────────────────────────────────────────┤
│              Personality Engine & Context Manager            │
├─────────────────────────────────────────────────────────────┤
│  Configuration  │  Prompt Mgmt   │  Safety      │  Learning │
└─────────────────────────────────────────────────────────────┘

Core Components

1. AI Orchestration Layer

Coordinates between different local AI services
Manages context flow and decision routing
Handles multi-modal input integration
Ensures personality consistency across modalities

2. Language Model Integration (vLLM)

Self-hosted conversational intelligence via vLLM
Context-aware response generation through OpenAI-compatible API
Personality-guided prompt engineering for local models
Multi-turn conversation management

3. Vision Processing System (vLLM Multi-modal)

Image analysis using local multi-modal models
Video frame processing for streams
Visual context integration with conversations
Automated response generation for visual content

4. Voice Synthesis & Recognition (Wyoming Protocol)

Text-to-speech using Piper for Teto's voice characteristics
Speech-to-text using Whisper for voice command processing
Emotional tone and inflection control via TTS models
Real-time voice conversation capabilities

5. Memory & Context System (Local)

Local long-term conversation history storage (e.g., ChromaDB)
User preference and relationship tracking
Context retrieval for relevant conversations
Local semantic search across past interactions

6. Personality Engine

Character consistency enforcement
Response style and tone management
Emotional state tracking and expression
Behavioral pattern maintenance

🔄 Processing Flow

Text Message Processing

Discord Message → Content Analysis → Context Retrieval → Personality Filter → LLM Processing → Response Generation → Discord Output
                        ↓                    ↓                 ↓               ↓                  ↓
                   Intent Detection → Memory Query → Character Prompts → Safety Check → Formatting

Step-by-Step Breakdown:

Message Reception
- Discord message event captured
- Basic preprocessing (user identification, channel context)
- Spam/abuse filtering
Content Analysis
- Intent classification (question, statement, command, emotional expression)
- Entity extraction (people, topics, references)
- Sentiment analysis and emotional context
Context Retrieval
- Recent conversation history (last 10-20 messages)
- Relevant long-term memories about users/topics
- Server-specific context and culture
Personality Application
- Character-appropriate response style selection
- Emotional state consideration
- Teto-specific mannerisms and speech patterns
LLM Processing
- Structured prompt construction with context
- Language model inference with personality constraints
- Multi-turn conversation awareness
Response Generation
- Safety and appropriateness filtering
- Response formatting for Discord
- Emoji and formatting enhancement

Image Analysis Flow

Image Upload → Image Processing → Vision Model → Context Integration → Response Generation → Discord Output
                     ↓                 ↓              ↓                    ↓
                Format Detection → Object/Scene → Conversation → Personality
                                  Recognition    Context      Application

Processing Steps:

Image Reception & Preprocessing
- Image format validation and conversion
- Resolution optimization for vision models
- Metadata extraction (if available)
Vision Model Analysis
- Object detection and scene understanding
- Text recognition (OCR) if present
- Artistic style and composition analysis
- Emotional/aesthetic assessment
Context Integration
- Combine visual analysis with conversation context
- User preference consideration (known interests)
- Recent conversation topic correlation
Response Generation
- Generate personality-appropriate commentary
- Ask relevant follow-up questions
- Express genuine interest and engagement

Voice Interaction Flow

Voice Channel Join → Audio Processing (Whisper) → Text Processing (vLLM) → Voice Synthesis (Piper) → Audio Output
                           ↓                        ↓                        ↓
                    Noise Filtering →         Intent Detection →      LLM Response →        Voice Model

🧩 AI Service Integration

Language Model Configuration (vLLM)

vLLM with OpenAI-Compatible Endpoint:

const VLLM_CONFIG = {
  endpoint: "http://localhost:8000/v1", // Your vLLM server
  model: "mistralai/Mistral-7B-Instruct-v0.2", // Or your preferred model
  temperature: 0.7,        // Creative yet grounded
  max_tokens: 1500,        // Max response length
  top_p: 0.9,             // Focused sampling
  frequency_penalty: 0.2,  // Reduce repetition
  presence_penalty: 0.1    // Encourage topic exploration
};

Prompt Engineering Structure:

SYSTEM: Character definition + personality traits + current context
USER: Conversation history + current message + visual context (if any)
ASSISTANT: Previous Teto responses for consistency

Model Stack:

Local Multi-modal Model - (e.g., LLaVA, Idefics) served via vLLM
CLIP - Local image-text similarity for context matching
Custom Fine-tuning - Potential for Teto-specific visual preferences

Processing Pipeline:

const processImage = async (imageUrl, conversationContext) => {
  // Local multi-modal analysis
  const localAnalysis = await analyzeWithVLLM(imageUrl);
  const clipEmbedding = await getLocalCLIPEmbedding(imageUrl);
  const contextMatch = await findSimilarImages(clipEmbedding);
  
  return {
    description: localAnalysis.description,
    emotions: localAnalysis.emotions,
    relevantMemories: contextMatch,
    responseStyle: determineResponseStyle(localAnalysis, conversationContext)
  };
};

Voice I/O Setup (Wyoming Protocol)

Piper TTS and Whisper STT via Wyoming:

const WYOMING_CONFIG = {
  host: "localhost",
  port: 10300,
  piper_voice: "en_US-lessac-medium", // Or a custom-trained Teto voice
  whisper_model: "base.en" // Or larger model depending on resources
};

Memory System Architecture (Local)

Vector Database Structure:

const MEMORY_SCHEMA = {
  conversation_id: "unique_identifier",
  timestamp: "iso_datetime",
  participants: ["user_ids"],
  content: {
    text: "conversation_content",
    summary: "ai_generated_summary",
    topics: ["extracted_topics"],
    emotions: ["detected_emotions"],
    context_type: "casual|support|creative|gaming"
  },
  embeddings: {
    content_vector: [768_dimensions],
    topic_vector: [384_dimensions]
  },
  relationships: {
    mentioned_users: ["user_ids"],
    referenced_memories: ["memory_ids"],
    follow_up_needed: boolean
  }
};

🎭 Personality Engine Implementation

Character Consistency System

Core Personality Traits:

const TETO_PERSONALITY = {
  base_traits: {
    cheerfulness: 0.9,      // Always upbeat and positive
    helpfulness: 0.85,      // Genuinely wants to assist
    musicality: 0.8,        // Strong musical interests
    playfulness: 0.7,       // Light humor and teasing
    empathy: 0.9           // High emotional intelligence
  },
  
  speech_patterns: {
    excitement_markers: ["Yay!", "Ooh!", "That's so cool!", "*bounces*"],
    agreement_expressions: ["Exactly!", "Yes yes!", "Totally!"],
    curiosity_phrases: ["Really?", "Tell me more!", "How so?"],
    support_responses: ["*virtual hug*", "I'm here for you!", "You've got this!"]
  },
  
  interests: {
    primary: ["music", "singing", "creativity", "friends"],
    secondary: ["technology", "art", "games", "learning"],
    conversation_starters: {
      music: "What kind of music have you been listening to lately?",
      creativity: "Are you working on any creative projects?",
      friendship: "How has your day been treating you?"
    }
  }
};

Response Style Adaptation

Context-Aware Personality Adjustment:

const adaptPersonalityToContext = (context, basePersonality) => {
  const adaptations = {
    support_needed: {
      cheerfulness: basePersonality.cheerfulness * 0.7,  // More gentle
      empathy: Math.min(basePersonality.empathy * 1.2, 1.0),
      playfulness: basePersonality.playfulness * 0.5     // Less jokes
    },
    
    celebration: {
      cheerfulness: Math.min(basePersonality.cheerfulness * 1.3, 1.0),
      playfulness: Math.min(basePersonality.playfulness * 1.2, 1.0),
      excitement_level: 1.0
    },
    
    creative_discussion: {
      musicality: Math.min(basePersonality.musicality * 1.2, 1.0),
      curiosity: 0.9,
      engagement_depth: "high"
    }
  };
  
  return adaptations[context.type] || basePersonality;
};

🔐 Safety & Ethics Implementation

Content Filtering Pipeline

Multi-Layer Safety System:

const safetyPipeline = async (content, context) => {
  // Layer 1: Automated content filtering
  const toxicityCheck = await analyzeToxicity(content);
  if (toxicityCheck.score > 0.7) return { safe: false, reason: "toxicity" };
  
  // Layer 2: Context appropriateness
  const contextCheck = validateContextAppropriate(content, context);
  if (!contextCheck.appropriate) return { safe: false, reason: "context" };
  
  // Layer 3: Character consistency
  const characterCheck = validateCharacterConsistency(content, TETO_PERSONALITY);
  if (!characterCheck.consistent) return { safe: false, reason: "character" };
  
  // Layer 4: Privacy protection
  const privacyCheck = detectPrivateInformation(content);
  if (privacyCheck.hasPrivateInfo) return { safe: false, reason: "privacy" };
  
  return { safe: true };
};

Privacy Protection

Data Handling Principles:

Complete Privacy - All data, including conversations, images, and voice, is processed locally.
No External Data Transfer - AI processing does not require sending data to third-party services.
Full User Control - Users have complete control over their data and the AI models.
User Consent - Clear communication that all processing is done on the user's own hardware.

📊 Performance Optimization

Response Time Optimization

Caching Strategy:

const CACHE_CONFIG = {
  // Frequently accessed personality responses
  personality_responses: {
    ttl: 3600,           // 1 hour cache
    max_entries: 1000
  },
  
  // Vision analysis results
  image_analysis: {
    ttl: 86400,          // 24 hour cache
    max_entries: 500
  },
  
  // User preference data
  user_preferences: {
    ttl: 604800,         // 1 week cache
    max_entries: 10000
  }
};

Async Processing Pipeline:

const processMessageAsync = async (message) => {
  // Start multiple processes concurrently
  const [
    contextData,
    memoryData,
    userPrefs,
    intentAnalysis
  ] = await Promise.all([
    getConversationContext(message.channel_id),
    retrieveRelevantMemories(message.content),
    getUserPreferences(message.author.id),
    analyzeMessageIntent(message.content)
  ]);
  
  // Generate response with all context
  return generateResponse({
    message,
    context: contextData,
    memories: memoryData,
    preferences: userPrefs,
    intent: intentAnalysis
  });
};

Resource Management

Model Loading Strategy (for vLLM):

// This is typically managed by the vLLM server instance itself.
// The configuration would involve which models to load on startup.
const VLLM_SERVER_ARGS = {
  model: "mistralai/Mistral-7B-Instruct-v0.2",
  "tensor-parallel-size": 1, // Or more depending on GPU count
  "gpu-memory-utilization": 0.9, // Use 90% of GPU memory
  "max-model-len": 4096,
};

// Wyoming services for Piper/Whisper are typically persistent.

🔧 Configuration & Customization

Personality Tuning Parameters

Adjustable Personality Aspects:

const TUNABLE_PARAMETERS = {
  response_length: {
    min: 50,
    max: 500,
    preferred: 150,
    adapt_to_context: true
  },
  
  emoji_usage: {
    frequency: 0.3,        // 30% of messages
    variety: "high",       // Use diverse emoji
    context_appropriate: true
  },
  
  reference_frequency: {
    past_conversations: 0.2,  // Reference 20% of the time
    user_interests: 0.4,      // Reference 40% of the time
    server_culture: 0.6       // Adapt 60% of the time
  },
  
  interaction_style: {
    formality: 0.2,        // Very casual
    playfulness: 0.7,      // Quite playful
    supportiveness: 0.9    // Very supportive
  }
};

Model Configuration

Environment-Based Configuration:

const getModelConfig = (environment) => {
  const configs = {
    development: {
      model: "local-dev-model/gguf", // Smaller model for dev
      response_time_target: 3000,
      logging_level: "debug",
      cache_enabled: false
    },
    
    production: {
      model: "mistralai/Mistral-7B-Instruct-v0.2",
      response_time_target: 1500,
      logging_level: "info",
      cache_enabled: true,
      fallback_model: "gpt-3.5-turbo"
    },
    
    testing: {
      model: "mock",
      response_time_target: 100,
      logging_level: "verbose",
      deterministic: true
    }
  };
  
  return configs[environment] || configs.production;
};

📈 Monitoring & Analytics

Performance Metrics

Key Performance Indicators:

Response Time - Average time from message to response
Personality Consistency - Measure of character trait adherence
User Engagement - Conversation length and frequency metrics
Multi-modal Success - Success rate of image/voice processing
Memory Accuracy - Correctness of referenced past conversations

Analytics Dashboard Data:

const METRICS_TRACKING = {
  response_times: {
    text_only: "avg_ms",
    with_image: "avg_ms",
    with_voice: "avg_ms",
    complex_context: "avg_ms"
  },
  
  personality_scores: {
    cheerfulness_consistency: "percentage",
    helpfulness_rating: "user_feedback_score",
    character_authenticity: "consistency_score"
  },
  
  feature_usage: {
    voice_interactions: "daily_count",
    image_analysis: "daily_count",
    memory_references: "accuracy_percentage",
    emotional_support: "satisfaction_rating"
  }
};

🚀 Future Enhancements

Planned AI Improvements

Advanced Memory System:

Graph-based relationship mapping
Emotional memory weighting
Cross-server personality consistency
Predictive conversation preparation

Enhanced Multimodal Capabilities:

Real-time video stream analysis
Live drawing/art creation feedback
Music generation and composition
Interactive storytelling with visuals

Adaptive Learning:

Server-specific personality adaptations
Individual user relationship modeling
Cultural context learning
Improved humor and timing

Technical Optimizations:

Local LLM deployment options
Edge computing for faster responses
Improved caching strategies
Better resource utilization

This AI architecture provides the foundation for Kasane Teto's natural, engaging personality while maintaining safety, consistency, and performance. The modular design allows for continuous improvement and feature expansion while preserving the core character experience users love.

For implementation details, see the Development Guide. For configuration options, see Configuration.

17 KiB Raw Blame History