🎭 Core Transformation: - Reframe project as AI companion bot with Kasane Teto personality - Focus on natural conversation, multimodal interaction, and character roleplay - Position video recording as one tool in AI toolkit rather than main feature 🏗️ Architecture Improvements: - Refactor messageCreate.js into modular command system (35 lines vs 310+) - Create dedicated videoRecording service with clean API - Implement commandHandler for extensible command routing - Add centralized configuration system (videoConfig.js) - Separate concerns: events, services, config, documentation 📚 Documentation Overhaul: - Consolidate scattered READMEs into organized docs/ directory - Create comprehensive documentation covering: * AI architecture and capabilities * Natural interaction patterns and personality * Setup guides for AI services and Docker deployment * Commands reference focused on conversational AI * Troubleshooting and development guidelines - Transform root README into compelling AI companion overview 🤖 AI-Ready Foundation: - Document integration points for: * Language models (GPT-4/Claude) for conversation * Vision models (GPT-4V/CLIP) for image analysis * Voice synthesis (ElevenLabs) for speaking * Memory systems for conversation continuity * Personality engine for character consistency 🔧 Technical Updates: - Integrate custom discord.js-selfbot-v13 submodule with enhanced functionality - Update package.json dependencies for AI and multimedia capabilities - Maintain Docker containerization with improved architecture - Add development and testing infrastructure 📖 New Documentation Structure: docs/ ├── README.md (documentation hub) ├── setup.md (installation & AI configuration) ├── interactions.md (how to chat with Teto) ├── ai-architecture.md (technical AI systems overview) ├── commands.md (natural language interactions) ├── personality.md (character understanding) ├── development.md (contributing guidelines) ├── troubleshooting.md (problem solving) └── [additional specialized guides] ✨ This update transforms the project from a simple recording bot into a foundation for an engaging AI companion that can naturally interact through text, voice, and visual content while maintaining authentic Kasane Teto personality traits.
540 lines
No EOL
17 KiB
Markdown
540 lines
No EOL
17 KiB
Markdown
# AI Architecture Overview
|
|
|
|
This document provides a comprehensive overview of how Kasane Teto's AI systems work together to create a natural, engaging, and authentic virtual companion experience.
|
|
|
|
## 🧠 System Architecture
|
|
|
|
### High-Level Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Discord Interface Layer │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ Event Processing │ Command Routing │ Response Handling │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ AI Orchestration │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ Language │ Vision │ Voice │ Memory │
|
|
│ Model │ System │ System │ System │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ Personality Engine & Context Manager │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ Configuration │ Prompt Mgmt │ Safety │ Learning │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Core Components
|
|
|
|
**1. AI Orchestration Layer**
|
|
- Coordinates between different AI services
|
|
- Manages context flow and decision routing
|
|
- Handles multi-modal input integration
|
|
- Ensures personality consistency across modalities
|
|
|
|
**2. Language Model Integration**
|
|
- Primary conversational intelligence (GPT-4/Claude)
|
|
- Context-aware response generation
|
|
- Personality-guided prompt engineering
|
|
- Multi-turn conversation management
|
|
|
|
**3. Vision Processing System**
|
|
- Image analysis and understanding
|
|
- Video frame processing for streams
|
|
- Visual context integration with conversations
|
|
- Automated response generation for visual content
|
|
|
|
**4. Voice Synthesis & Recognition**
|
|
- Text-to-speech with Teto's voice characteristics
|
|
- Speech-to-text for voice command processing
|
|
- Emotional tone and inflection control
|
|
- Real-time voice conversation capabilities
|
|
|
|
**5. Memory & Context System**
|
|
- Long-term conversation history storage
|
|
- User preference and relationship tracking
|
|
- Context retrieval for relevant conversations
|
|
- Semantic search across past interactions
|
|
|
|
**6. Personality Engine**
|
|
- Character consistency enforcement
|
|
- Response style and tone management
|
|
- Emotional state tracking and expression
|
|
- Behavioral pattern maintenance
|
|
|
|
## 🔄 Processing Flow
|
|
|
|
### Text Message Processing
|
|
|
|
```
|
|
Discord Message → Content Analysis → Context Retrieval → Personality Filter → LLM Processing → Response Generation → Discord Output
|
|
↓ ↓ ↓ ↓ ↓
|
|
Intent Detection → Memory Query → Character Prompts → Safety Check → Formatting
|
|
```
|
|
|
|
**Step-by-Step Breakdown:**
|
|
|
|
1. **Message Reception**
|
|
- Discord message event captured
|
|
- Basic preprocessing (user identification, channel context)
|
|
- Spam/abuse filtering
|
|
|
|
2. **Content Analysis**
|
|
- Intent classification (question, statement, command, emotional expression)
|
|
- Entity extraction (people, topics, references)
|
|
- Sentiment analysis and emotional context
|
|
|
|
3. **Context Retrieval**
|
|
- Recent conversation history (last 10-20 messages)
|
|
- Relevant long-term memories about users/topics
|
|
- Server-specific context and culture
|
|
|
|
4. **Personality Application**
|
|
- Character-appropriate response style selection
|
|
- Emotional state consideration
|
|
- Teto-specific mannerisms and speech patterns
|
|
|
|
5. **LLM Processing**
|
|
- Structured prompt construction with context
|
|
- Language model inference with personality constraints
|
|
- Multi-turn conversation awareness
|
|
|
|
6. **Response Generation**
|
|
- Safety and appropriateness filtering
|
|
- Response formatting for Discord
|
|
- Emoji and formatting enhancement
|
|
|
|
### Image Analysis Flow
|
|
|
|
```
|
|
Image Upload → Image Processing → Vision Model → Context Integration → Response Generation → Discord Output
|
|
↓ ↓ ↓ ↓
|
|
Format Detection → Object/Scene → Conversation → Personality
|
|
Recognition Context Application
|
|
```
|
|
|
|
**Processing Steps:**
|
|
|
|
1. **Image Reception & Preprocessing**
|
|
- Image format validation and conversion
|
|
- Resolution optimization for vision models
|
|
- Metadata extraction (if available)
|
|
|
|
2. **Vision Model Analysis**
|
|
- Object detection and scene understanding
|
|
- Text recognition (OCR) if present
|
|
- Artistic style and composition analysis
|
|
- Emotional/aesthetic assessment
|
|
|
|
3. **Context Integration**
|
|
- Combine visual analysis with conversation context
|
|
- User preference consideration (known interests)
|
|
- Recent conversation topic correlation
|
|
|
|
4. **Response Generation**
|
|
- Generate personality-appropriate commentary
|
|
- Ask relevant follow-up questions
|
|
- Express genuine interest and engagement
|
|
|
|
### Voice Interaction Flow
|
|
|
|
```
|
|
Voice Channel Join → Audio Processing → Speech Recognition → Text Processing → Voice Synthesis → Audio Output
|
|
↓ ↓ ↓ ↓
|
|
Noise Filtering → Intent Detection → LLM Response → Voice Cloning
|
|
```
|
|
|
|
## 🧩 AI Service Integration
|
|
|
|
### Language Model Configuration
|
|
|
|
**Primary Model: GPT-4 Turbo**
|
|
```javascript
|
|
const LLM_CONFIG = {
|
|
model: "gpt-4-turbo-preview",
|
|
temperature: 0.8, // Creative but consistent
|
|
max_tokens: 1000, // Reasonable response length
|
|
top_p: 0.9, // Focused but diverse
|
|
frequency_penalty: 0.3, // Reduce repetition
|
|
presence_penalty: 0.2 // Encourage topic exploration
|
|
};
|
|
```
|
|
|
|
**Prompt Engineering Structure:**
|
|
```
|
|
SYSTEM: Character definition + personality traits + current context
|
|
USER: Conversation history + current message + visual context (if any)
|
|
ASSISTANT: Previous Teto responses for consistency
|
|
```
|
|
|
|
### Vision Model Integration
|
|
|
|
**Model Stack:**
|
|
- **GPT-4 Vision** - Primary image understanding
|
|
- **CLIP** - Image-text similarity for context matching
|
|
- **Custom Fine-tuning** - Teto-specific visual preferences
|
|
|
|
**Processing Pipeline:**
|
|
```javascript
|
|
const processImage = async (imageUrl, conversationContext) => {
|
|
// Multi-model analysis for comprehensive understanding
|
|
const gpt4Analysis = await analyzeWithGPT4V(imageUrl);
|
|
const clipEmbedding = await getCLIPEmbedding(imageUrl);
|
|
const contextMatch = await findSimilarImages(clipEmbedding);
|
|
|
|
return {
|
|
description: gpt4Analysis.description,
|
|
emotions: gpt4Analysis.emotions,
|
|
relevantMemories: contextMatch,
|
|
responseStyle: determineResponseStyle(gpt4Analysis, conversationContext)
|
|
};
|
|
};
|
|
```
|
|
|
|
### Voice Synthesis Setup
|
|
|
|
**ElevenLabs Configuration:**
|
|
```javascript
|
|
const VOICE_CONFIG = {
|
|
voice_id: "kasane_teto_voice_clone",
|
|
model_id: "eleven_multilingual_v2",
|
|
stability: 0.75, // Consistent voice characteristics
|
|
similarity_boost: 0.8, // Maintain Teto's voice signature
|
|
style: 0.6, // Moderate emotional expression
|
|
use_speaker_boost: true // Enhanced clarity
|
|
};
|
|
```
|
|
|
|
### Memory System Architecture
|
|
|
|
**Vector Database Structure:**
|
|
```javascript
|
|
const MEMORY_SCHEMA = {
|
|
conversation_id: "unique_identifier",
|
|
timestamp: "iso_datetime",
|
|
participants: ["user_ids"],
|
|
content: {
|
|
text: "conversation_content",
|
|
summary: "ai_generated_summary",
|
|
topics: ["extracted_topics"],
|
|
emotions: ["detected_emotions"],
|
|
context_type: "casual|support|creative|gaming"
|
|
},
|
|
embeddings: {
|
|
content_vector: [768_dimensions],
|
|
topic_vector: [384_dimensions]
|
|
},
|
|
relationships: {
|
|
mentioned_users: ["user_ids"],
|
|
referenced_memories: ["memory_ids"],
|
|
follow_up_needed: boolean
|
|
}
|
|
};
|
|
```
|
|
|
|
## 🎭 Personality Engine Implementation
|
|
|
|
### Character Consistency System
|
|
|
|
**Core Personality Traits:**
|
|
```javascript
|
|
const TETO_PERSONALITY = {
|
|
base_traits: {
|
|
cheerfulness: 0.9, // Always upbeat and positive
|
|
helpfulness: 0.85, // Genuinely wants to assist
|
|
musicality: 0.8, // Strong musical interests
|
|
playfulness: 0.7, // Light humor and teasing
|
|
empathy: 0.9 // High emotional intelligence
|
|
},
|
|
|
|
speech_patterns: {
|
|
excitement_markers: ["Yay!", "Ooh!", "That's so cool!", "*bounces*"],
|
|
agreement_expressions: ["Exactly!", "Yes yes!", "Totally!"],
|
|
curiosity_phrases: ["Really?", "Tell me more!", "How so?"],
|
|
support_responses: ["*virtual hug*", "I'm here for you!", "You've got this!"]
|
|
},
|
|
|
|
interests: {
|
|
primary: ["music", "singing", "creativity", "friends"],
|
|
secondary: ["technology", "art", "games", "learning"],
|
|
conversation_starters: {
|
|
music: "What kind of music have you been listening to lately?",
|
|
creativity: "Are you working on any creative projects?",
|
|
friendship: "How has your day been treating you?"
|
|
}
|
|
}
|
|
};
|
|
```
|
|
|
|
### Response Style Adaptation
|
|
|
|
**Context-Aware Personality Adjustment:**
|
|
```javascript
|
|
const adaptPersonalityToContext = (context, basePersonality) => {
|
|
const adaptations = {
|
|
support_needed: {
|
|
cheerfulness: basePersonality.cheerfulness * 0.7, // More gentle
|
|
empathy: Math.min(basePersonality.empathy * 1.2, 1.0),
|
|
playfulness: basePersonality.playfulness * 0.5 // Less jokes
|
|
},
|
|
|
|
celebration: {
|
|
cheerfulness: Math.min(basePersonality.cheerfulness * 1.3, 1.0),
|
|
playfulness: Math.min(basePersonality.playfulness * 1.2, 1.0),
|
|
excitement_level: 1.0
|
|
},
|
|
|
|
creative_discussion: {
|
|
musicality: Math.min(basePersonality.musicality * 1.2, 1.0),
|
|
curiosity: 0.9,
|
|
engagement_depth: "high"
|
|
}
|
|
};
|
|
|
|
return adaptations[context.type] || basePersonality;
|
|
};
|
|
```
|
|
|
|
## 🔐 Safety & Ethics Implementation
|
|
|
|
### Content Filtering Pipeline
|
|
|
|
**Multi-Layer Safety System:**
|
|
```javascript
|
|
const safetyPipeline = async (content, context) => {
|
|
// Layer 1: Automated content filtering
|
|
const toxicityCheck = await analyzeToxicity(content);
|
|
if (toxicityCheck.score > 0.7) return { safe: false, reason: "toxicity" };
|
|
|
|
// Layer 2: Context appropriateness
|
|
const contextCheck = validateContextAppropriate(content, context);
|
|
if (!contextCheck.appropriate) return { safe: false, reason: "context" };
|
|
|
|
// Layer 3: Character consistency
|
|
const characterCheck = validateCharacterConsistency(content, TETO_PERSONALITY);
|
|
if (!characterCheck.consistent) return { safe: false, reason: "character" };
|
|
|
|
// Layer 4: Privacy protection
|
|
const privacyCheck = detectPrivateInformation(content);
|
|
if (privacyCheck.hasPrivateInfo) return { safe: false, reason: "privacy" };
|
|
|
|
return { safe: true };
|
|
};
|
|
```
|
|
|
|
### Privacy Protection
|
|
|
|
**Data Handling Principles:**
|
|
- **Local Memory Storage** - Conversation history stored locally, not sent to external services
|
|
- **Anonymized Analytics** - Usage patterns tracked without personal identifiers
|
|
- **Selective Context** - Only relevant conversation context sent to AI models
|
|
- **User Consent** - Clear communication about data usage and AI processing
|
|
|
|
## 📊 Performance Optimization
|
|
|
|
### Response Time Optimization
|
|
|
|
**Caching Strategy:**
|
|
```javascript
|
|
const CACHE_CONFIG = {
|
|
// Frequently accessed personality responses
|
|
personality_responses: {
|
|
ttl: 3600, // 1 hour cache
|
|
max_entries: 1000
|
|
},
|
|
|
|
// Vision analysis results
|
|
image_analysis: {
|
|
ttl: 86400, // 24 hour cache
|
|
max_entries: 500
|
|
},
|
|
|
|
// User preference data
|
|
user_preferences: {
|
|
ttl: 604800, // 1 week cache
|
|
max_entries: 10000
|
|
}
|
|
};
|
|
```
|
|
|
|
**Async Processing Pipeline:**
|
|
```javascript
|
|
const processMessageAsync = async (message) => {
|
|
// Start multiple processes concurrently
|
|
const [
|
|
contextData,
|
|
memoryData,
|
|
userPrefs,
|
|
intentAnalysis
|
|
] = await Promise.all([
|
|
getConversationContext(message.channel_id),
|
|
retrieveRelevantMemories(message.content),
|
|
getUserPreferences(message.author.id),
|
|
analyzeMessageIntent(message.content)
|
|
]);
|
|
|
|
// Generate response with all context
|
|
return generateResponse({
|
|
message,
|
|
context: contextData,
|
|
memories: memoryData,
|
|
preferences: userPrefs,
|
|
intent: intentAnalysis
|
|
});
|
|
};
|
|
```
|
|
|
|
### Resource Management
|
|
|
|
**Model Loading Strategy:**
|
|
```javascript
|
|
const MODEL_LOADING = {
|
|
// Keep language model always loaded
|
|
language_model: "persistent",
|
|
|
|
// Load vision model on demand
|
|
vision_model: "on_demand",
|
|
|
|
// Pre-load voice synthesis during voice channel activity
|
|
voice_synthesis: "predictive",
|
|
|
|
// Cache embeddings for frequent users
|
|
user_embeddings: "lru_cache"
|
|
};
|
|
```
|
|
|
|
## 🔧 Configuration & Customization
|
|
|
|
### Personality Tuning Parameters
|
|
|
|
**Adjustable Personality Aspects:**
|
|
```javascript
|
|
const TUNABLE_PARAMETERS = {
|
|
response_length: {
|
|
min: 50,
|
|
max: 500,
|
|
preferred: 150,
|
|
adapt_to_context: true
|
|
},
|
|
|
|
emoji_usage: {
|
|
frequency: 0.3, // 30% of messages
|
|
variety: "high", // Use diverse emoji
|
|
context_appropriate: true
|
|
},
|
|
|
|
reference_frequency: {
|
|
past_conversations: 0.2, // Reference 20% of the time
|
|
user_interests: 0.4, // Reference 40% of the time
|
|
server_culture: 0.6 // Adapt 60% of the time
|
|
},
|
|
|
|
interaction_style: {
|
|
formality: 0.2, // Very casual
|
|
playfulness: 0.7, // Quite playful
|
|
supportiveness: 0.9 // Very supportive
|
|
}
|
|
};
|
|
```
|
|
|
|
### Model Configuration
|
|
|
|
**Environment-Based Configuration:**
|
|
```javascript
|
|
const getModelConfig = (environment) => {
|
|
const configs = {
|
|
development: {
|
|
model: "gpt-3.5-turbo",
|
|
response_time_target: 3000,
|
|
logging_level: "debug",
|
|
cache_enabled: false
|
|
},
|
|
|
|
production: {
|
|
model: "gpt-4-turbo-preview",
|
|
response_time_target: 1500,
|
|
logging_level: "info",
|
|
cache_enabled: true,
|
|
fallback_model: "gpt-3.5-turbo"
|
|
},
|
|
|
|
testing: {
|
|
model: "mock",
|
|
response_time_target: 100,
|
|
logging_level: "verbose",
|
|
deterministic: true
|
|
}
|
|
};
|
|
|
|
return configs[environment] || configs.production;
|
|
};
|
|
```
|
|
|
|
## 📈 Monitoring & Analytics
|
|
|
|
### Performance Metrics
|
|
|
|
**Key Performance Indicators:**
|
|
- **Response Time** - Average time from message to response
|
|
- **Personality Consistency** - Measure of character trait adherence
|
|
- **User Engagement** - Conversation length and frequency metrics
|
|
- **Multi-modal Success** - Success rate of image/voice processing
|
|
- **Memory Accuracy** - Correctness of referenced past conversations
|
|
|
|
**Analytics Dashboard Data:**
|
|
```javascript
|
|
const METRICS_TRACKING = {
|
|
response_times: {
|
|
text_only: "avg_ms",
|
|
with_image: "avg_ms",
|
|
with_voice: "avg_ms",
|
|
complex_context: "avg_ms"
|
|
},
|
|
|
|
personality_scores: {
|
|
cheerfulness_consistency: "percentage",
|
|
helpfulness_rating: "user_feedback_score",
|
|
character_authenticity: "consistency_score"
|
|
},
|
|
|
|
feature_usage: {
|
|
voice_interactions: "daily_count",
|
|
image_analysis: "daily_count",
|
|
memory_references: "accuracy_percentage",
|
|
emotional_support: "satisfaction_rating"
|
|
}
|
|
};
|
|
```
|
|
|
|
## 🚀 Future Enhancements
|
|
|
|
### Planned AI Improvements
|
|
|
|
**Advanced Memory System:**
|
|
- Graph-based relationship mapping
|
|
- Emotional memory weighting
|
|
- Cross-server personality consistency
|
|
- Predictive conversation preparation
|
|
|
|
**Enhanced Multimodal Capabilities:**
|
|
- Real-time video stream analysis
|
|
- Live drawing/art creation feedback
|
|
- Music generation and composition
|
|
- Interactive storytelling with visuals
|
|
|
|
**Adaptive Learning:**
|
|
- Server-specific personality adaptations
|
|
- Individual user relationship modeling
|
|
- Cultural context learning
|
|
- Improved humor and timing
|
|
|
|
**Technical Optimizations:**
|
|
- Local LLM deployment options
|
|
- Edge computing for faster responses
|
|
- Improved caching strategies
|
|
- Better resource utilization
|
|
|
|
---
|
|
|
|
This AI architecture provides the foundation for Kasane Teto's natural, engaging personality while maintaining safety, consistency, and performance. The modular design allows for continuous improvement and feature expansion while preserving the core character experience users love.
|
|
|
|
For implementation details, see the [Development Guide](development.md). For configuration options, see [Configuration](configuration.md). |