xref: /plugin/aichat/AGENTS.md (revision af51f17228b8380d19956dbeab1817494aa414bc)
1*af51f172SAndreas Gohr# AGENTS.md
2*af51f172SAndreas Gohr
3*af51f172SAndreas GohrThis file provides guidance to LLM Code Agents when working with code in this repository.
4*af51f172SAndreas Gohr
5*af51f172SAndreas Gohr## Overview
6*af51f172SAndreas Gohr
7*af51f172SAndreas GohrThis is a DokuWiki plugin that enables AI-powered chat functionality using LLMs (Large Language Models) and RAG (Retrieval-Augmented Generation). The plugin indexes wiki pages as embeddings in a vector database and allows users to ask questions about wiki content.
8*af51f172SAndreas Gohr
9*af51f172SAndreas Gohr## Development Commands
10*af51f172SAndreas Gohr
11*af51f172SAndreas Gohr### Testing
12*af51f172SAndreas Gohr```bash
13*af51f172SAndreas Gohr../../../bin/plugin.php dev test
14*af51f172SAndreas Gohr```
15*af51f172SAndreas Gohr
16*af51f172SAndreas Gohr### CLI Commands
17*af51f172SAndreas GohrThe plugin provides a CLI interface via `cli.php`:
18*af51f172SAndreas Gohr
19*af51f172SAndreas Gohr```bash
20*af51f172SAndreas Gohr# Get a list of available commands
21*af51f172SAndreas Gohr../../../bin/plugin.php aichat --help
22*af51f172SAndreas Gohr```
23*af51f172SAndreas Gohr
24*af51f172SAndreas Gohr## Architecture
25*af51f172SAndreas Gohr
26*af51f172SAndreas Gohr### Core Components
27*af51f172SAndreas Gohr
28*af51f172SAndreas Gohr**helper.php (helper_plugin_aichat)**
29*af51f172SAndreas Gohr- Main entry point for plugin functionality
30*af51f172SAndreas Gohr- Manages model factory and configuration
31*af51f172SAndreas Gohr- Handles question answering with context retrieval
32*af51f172SAndreas Gohr- Prepares messages with chat history and token limits
33*af51f172SAndreas Gohr- Implements question rephrasing for better context search
34*af51f172SAndreas Gohr
35*af51f172SAndreas Gohr**Embeddings.php**
36*af51f172SAndreas Gohr- Manages the vector embeddings index
37*af51f172SAndreas Gohr- Splits pages into chunks using TextSplitter
38*af51f172SAndreas Gohr- Creates and retrieves embeddings via embedding models
39*af51f172SAndreas Gohr- Performs similarity searches through storage backends
40*af51f172SAndreas Gohr- Handles incremental indexing (only updates changed pages)
41*af51f172SAndreas Gohr
42*af51f172SAndreas Gohr**TextSplitter.php**
43*af51f172SAndreas Gohr- Splits text into token-sized chunks (configurable, typically ~1000 tokens)
44*af51f172SAndreas Gohr- Prefers sentence boundaries using Vanderlee\Sentence
45*af51f172SAndreas Gohr- Handles long sentences by splitting at word boundaries
46*af51f172SAndreas Gohr- Maintains overlap between chunks (MAX_OVERLAP_LEN = 200 tokens) for context preservation
47*af51f172SAndreas Gohr
48*af51f172SAndreas Gohr**ModelFactory.php**
49*af51f172SAndreas Gohr- Creates and caches model instances (chat, rephrase, embedding)
50*af51f172SAndreas Gohr- Loads model configurations from Model/*/models.json files
51*af51f172SAndreas Gohr- Supports multiple providers: OpenAI, Gemini, Anthropic, Mistral, Ollama, Groq, Reka, VoyageAI
52*af51f172SAndreas Gohr
53*af51f172SAndreas Gohr### Model System
54*af51f172SAndreas Gohr
55*af51f172SAndreas Gohr**Model/AbstractModel.php**
56*af51f172SAndreas Gohr- Base class for all LLM implementations
57*af51f172SAndreas Gohr- Handles API communication with retry logic (MAX_RETRIES = 3)
58*af51f172SAndreas Gohr- Tracks usage statistics (tokens, costs, time, requests)
59*af51f172SAndreas Gohr- Implements debug mode for API inspection
60*af51f172SAndreas Gohr- Uses DokuHTTPClient for HTTP requests
61*af51f172SAndreas Gohr
62*af51f172SAndreas Gohr**Model Interfaces**
63*af51f172SAndreas Gohr- `ChatInterface`: For conversational models (getAnswer method)
64*af51f172SAndreas Gohr- `EmbeddingInterface`: For embedding models (getEmbedding method, getDimensions method)
65*af51f172SAndreas Gohr- `ModelInterface`: Base interface with token limits and pricing info
66*af51f172SAndreas Gohr
67*af51f172SAndreas Gohr**Model Providers**
68*af51f172SAndreas GohrEach provider has its own namespace under Model/:
69*af51f172SAndreas Gohr- OpenAI/, Gemini/, Anthropic/, Mistral/, Ollama/, Groq/, Reka/, VoyageAI/
70*af51f172SAndreas Gohr- Each contains ChatModel.php and/or EmbeddingModel.php
71*af51f172SAndreas Gohr- Model info (token limits, pricing, dimensions) defined in models.json
72*af51f172SAndreas Gohr
73*af51f172SAndreas Gohr### Storage Backends
74*af51f172SAndreas Gohr
75*af51f172SAndreas Gohr**Storage/AbstractStorage.php**
76*af51f172SAndreas Gohr- Abstract base for vector storage implementations
77*af51f172SAndreas Gohr- Defines interface for chunk storage and similarity search
78*af51f172SAndreas Gohr
79*af51f172SAndreas Gohr**Available Implementations:**
80*af51f172SAndreas Gohr- SQLiteStorage: Local SQLite database
81*af51f172SAndreas Gohr- ChromaStorage: Chroma vector database
82*af51f172SAndreas Gohr- PineconeStorage: Pinecone cloud service
83*af51f172SAndreas Gohr- QdrantStorage: Qdrant vector database
84*af51f172SAndreas Gohr
85*af51f172SAndreas Gohr### Data Flow
86*af51f172SAndreas Gohr
87*af51f172SAndreas Gohr1. **Indexing**: Pages → TextSplitter → Chunks → EmbeddingModel → Vector Storage
88*af51f172SAndreas Gohr2. **Querying**: Question → EmbeddingModel → Vector → Storage.getSimilarChunks() → Filtered Chunks
89*af51f172SAndreas Gohr3. **Chat**: Question + History + Context Chunks → ChatModel → Answer
90*af51f172SAndreas Gohr
91*af51f172SAndreas Gohr### Key Features
92*af51f172SAndreas Gohr
93*af51f172SAndreas Gohr**Question Rephrasing**
94*af51f172SAndreas Gohr- Converts follow-up questions into standalone questions using chat history
95*af51f172SAndreas Gohr- Controlled by `rephraseHistory` config (number of history entries to use)
96*af51f172SAndreas Gohr- Only applied when rephraseHistory > chatHistory to avoid redundancy
97*af51f172SAndreas Gohr
98*af51f172SAndreas Gohr**Context Management**
99*af51f172SAndreas Gohr- Chunks include breadcrumb trail (namespace hierarchy + page title)
100*af51f172SAndreas Gohr- Token counting uses tiktoken-php for accurate limits
101*af51f172SAndreas Gohr- Respects model's max input token length
102*af51f172SAndreas Gohr- Filters chunks by ACL permissions and similarity threshold
103*af51f172SAndreas Gohr
104*af51f172SAndreas Gohr**Language Support**
105*af51f172SAndreas Gohr- `preferUIlanguage` setting controls language behavior:
106*af51f172SAndreas Gohr  - LANG_AUTO_ALL: Auto-detect from question
107*af51f172SAndreas Gohr  - LANG_UI_ALL: Always use UI language
108*af51f172SAndreas Gohr  - LANG_UI_LIMITED: Use UI language and limit sources to that language
109*af51f172SAndreas Gohr
110*af51f172SAndreas Gohr### AJAX Integration
111*af51f172SAndreas Gohr
112*af51f172SAndreas Gohr**action.php**
113*af51f172SAndreas Gohr- Handles `AJAX_CALL_UNKNOWN` event for 'aichat' calls
114*af51f172SAndreas Gohr- Processes questions with chat history
115*af51f172SAndreas Gohr- Returns JSON with answer (as rendered Markdown), sources, and similarity scores
116*af51f172SAndreas Gohr- Implements access restrictions via helper->userMayAccess()
117*af51f172SAndreas Gohr- Optional logging of all interactions
118*af51f172SAndreas Gohr
119*af51f172SAndreas Gohr### Frontend
120*af51f172SAndreas Gohr- **script/**: JavaScript for UI integration
121*af51f172SAndreas Gohr- **syntax/**: DokuWiki syntax components
122*af51f172SAndreas Gohr- **renderer.php**: Custom renderer for AI chat output
123*af51f172SAndreas Gohr
124*af51f172SAndreas Gohr## Configuration
125*af51f172SAndreas Gohr
126*af51f172SAndreas GohrPlugin configuration is in `conf/`:
127*af51f172SAndreas Gohr- **default.php**: Default config values
128*af51f172SAndreas Gohr- **metadata.php**: Config field definitions and validation
129*af51f172SAndreas Gohr
130*af51f172SAndreas GohrKey settings:
131*af51f172SAndreas Gohr- Model selection: chatmodel, rephrasemodel, embedmodel
132*af51f172SAndreas Gohr- Storage: storage backend type
133*af51f172SAndreas Gohr- API keys: openai_apikey, gemini_apikey, etc.
134*af51f172SAndreas Gohr- Chunk settings: chunkSize, contextChunks, similarityThreshold
135*af51f172SAndreas Gohr- History: chatHistory, rephraseHistory
136*af51f172SAndreas Gohr- Access: restrict (user/group restrictions)
137*af51f172SAndreas Gohr- Indexing filters: skipRegex, matchRegex
138*af51f172SAndreas Gohr
139*af51f172SAndreas Gohr## Testing
140*af51f172SAndreas Gohr
141*af51f172SAndreas GohrTests are in `_test/` directory:
142*af51f172SAndreas Gohr- Extends DokuWikiTest base class
143*af51f172SAndreas Gohr- Uses @group plugin_aichat annotation
144*af51f172SAndreas Gohr
145*af51f172SAndreas Gohr## Important Implementation Notes
146*af51f172SAndreas Gohr
147*af51f172SAndreas Gohr- All token counting uses TikToken encoder for rough estimates
148*af51f172SAndreas Gohr- Chunk IDs are calculated as: pageID * 100 + chunk_sequence (pageIDs come from DokuWiki's internal search index)
149*af51f172SAndreas Gohr- Models are cached in ModelFactory to avoid re-initialization
150*af51f172SAndreas Gohr- API retries use exponential backoff (sleep for retry count seconds)
151*af51f172SAndreas Gohr- Breadcrumb trails provide context to AI without requiring full page content
152*af51f172SAndreas Gohr- Storage backends handle similarity search differently but provide unified interface
153*af51f172SAndreas Gohr- UTF-8 handling is critical for text splitting (uses dokuwiki\Utf8\PhpString)
154