xref: /plugin/aichat/AGENTS.md (revision e2b35d4678a98550a165d36dd3db63cf0d958ab7)
1af51f172SAndreas Gohr# AGENTS.md
2af51f172SAndreas Gohr
3af51f172SAndreas GohrThis file provides guidance to LLM Code Agents when working with code in this repository.
4af51f172SAndreas Gohr
5af51f172SAndreas Gohr## Overview
6af51f172SAndreas Gohr
7af51f172SAndreas GohrThis is a DokuWiki plugin that enables AI-powered chat functionality using LLMs (Large Language Models) and RAG (Retrieval-Augmented Generation). The plugin indexes wiki pages as embeddings in a vector database and allows users to ask questions about wiki content.
8af51f172SAndreas Gohr
9af51f172SAndreas Gohr## Development Commands
10af51f172SAndreas Gohr
11af51f172SAndreas Gohr### Testing
12*e2b35d46SAndreas Gohr
13af51f172SAndreas Gohr```bash
14af51f172SAndreas Gohr../../../bin/plugin.php dev test
15af51f172SAndreas Gohr```
16af51f172SAndreas Gohr
17*e2b35d46SAndreas GohrThe command does not accept any additional arguments or parameters and runs all tests in the `_test/` directory.
18*e2b35d46SAndreas Gohr
19*e2b35d46SAndreas GohrPHPUnit can also be called directly, when special options are needed:
20*e2b35d46SAndreas Gohr
21*e2b35d46SAndreas Gohr```bash
22*e2b35d46SAndreas Gohr../../../_test/vendor/bin/phpunit -c ../../../_test/phpunit.xml --group plugin_aichat
23*e2b35d46SAndreas Gohr```
24*e2b35d46SAndreas Gohr
25af51f172SAndreas Gohr### CLI Commands
26af51f172SAndreas GohrThe plugin provides a CLI interface via `cli.php`:
27af51f172SAndreas Gohr
28af51f172SAndreas Gohr```bash
29af51f172SAndreas Gohr# Get a list of available commands
30af51f172SAndreas Gohr../../../bin/plugin.php aichat --help
31af51f172SAndreas Gohr```
32af51f172SAndreas Gohr
33af51f172SAndreas Gohr## Architecture
34af51f172SAndreas Gohr
35af51f172SAndreas Gohr### Core Components
36af51f172SAndreas Gohr
37af51f172SAndreas Gohr**helper.php (helper_plugin_aichat)**
38af51f172SAndreas Gohr- Main entry point for plugin functionality
39af51f172SAndreas Gohr- Manages model factory and configuration
40af51f172SAndreas Gohr- Handles question answering with context retrieval
41af51f172SAndreas Gohr- Prepares messages with chat history and token limits
42af51f172SAndreas Gohr- Implements question rephrasing for better context search
43af51f172SAndreas Gohr
44af51f172SAndreas Gohr**Embeddings.php**
45af51f172SAndreas Gohr- Manages the vector embeddings index
46af51f172SAndreas Gohr- Splits pages into chunks using TextSplitter
47af51f172SAndreas Gohr- Creates and retrieves embeddings via embedding models
48af51f172SAndreas Gohr- Performs similarity searches through storage backends
49af51f172SAndreas Gohr- Handles incremental indexing (only updates changed pages)
50af51f172SAndreas Gohr
51af51f172SAndreas Gohr**TextSplitter.php**
52af51f172SAndreas Gohr- Splits text into token-sized chunks (configurable, typically ~1000 tokens)
53af51f172SAndreas Gohr- Prefers sentence boundaries using Vanderlee\Sentence
54af51f172SAndreas Gohr- Handles long sentences by splitting at word boundaries
55af51f172SAndreas Gohr- Maintains overlap between chunks (MAX_OVERLAP_LEN = 200 tokens) for context preservation
56af51f172SAndreas Gohr
57af51f172SAndreas Gohr**ModelFactory.php**
58af51f172SAndreas Gohr- Creates and caches model instances (chat, rephrase, embedding)
59af51f172SAndreas Gohr- Loads model configurations from Model/*/models.json files
60af51f172SAndreas Gohr- Supports multiple providers: OpenAI, Gemini, Anthropic, Mistral, Ollama, Groq, Reka, VoyageAI
61af51f172SAndreas Gohr
62af51f172SAndreas Gohr### Model System
63af51f172SAndreas Gohr
64af51f172SAndreas Gohr**Model/AbstractModel.php**
65af51f172SAndreas Gohr- Base class for all LLM implementations
66af51f172SAndreas Gohr- Handles API communication with retry logic (MAX_RETRIES = 3)
67af51f172SAndreas Gohr- Tracks usage statistics (tokens, costs, time, requests)
68af51f172SAndreas Gohr- Implements debug mode for API inspection
69af51f172SAndreas Gohr- Uses DokuHTTPClient for HTTP requests
70af51f172SAndreas Gohr
71af51f172SAndreas Gohr**Model Interfaces**
72af51f172SAndreas Gohr- `ChatInterface`: For conversational models (getAnswer method)
73af51f172SAndreas Gohr- `EmbeddingInterface`: For embedding models (getEmbedding method, getDimensions method)
74af51f172SAndreas Gohr- `ModelInterface`: Base interface with token limits and pricing info
75af51f172SAndreas Gohr
76af51f172SAndreas Gohr**Model Providers**
77af51f172SAndreas GohrEach provider has its own namespace under Model/:
78af51f172SAndreas Gohr- OpenAI/, Gemini/, Anthropic/, Mistral/, Ollama/, Groq/, Reka/, VoyageAI/
79af51f172SAndreas Gohr- Each contains ChatModel.php and/or EmbeddingModel.php
80af51f172SAndreas Gohr- Model info (token limits, pricing, dimensions) defined in models.json
81af51f172SAndreas Gohr
82af51f172SAndreas Gohr### Storage Backends
83af51f172SAndreas Gohr
84af51f172SAndreas Gohr**Storage/AbstractStorage.php**
85af51f172SAndreas Gohr- Abstract base for vector storage implementations
86af51f172SAndreas Gohr- Defines interface for chunk storage and similarity search
87af51f172SAndreas Gohr
88af51f172SAndreas Gohr**Available Implementations:**
89af51f172SAndreas Gohr- SQLiteStorage: Local SQLite database
90af51f172SAndreas Gohr- ChromaStorage: Chroma vector database
91af51f172SAndreas Gohr- PineconeStorage: Pinecone cloud service
92af51f172SAndreas Gohr- QdrantStorage: Qdrant vector database
93af51f172SAndreas Gohr
94af51f172SAndreas Gohr### Data Flow
95af51f172SAndreas Gohr
96af51f172SAndreas Gohr1. **Indexing**: Pages → TextSplitter → Chunks → EmbeddingModel → Vector Storage
97af51f172SAndreas Gohr2. **Querying**: Question → EmbeddingModel → Vector → Storage.getSimilarChunks() → Filtered Chunks
98af51f172SAndreas Gohr3. **Chat**: Question + History + Context Chunks → ChatModel → Answer
99af51f172SAndreas Gohr
100af51f172SAndreas Gohr### Key Features
101af51f172SAndreas Gohr
102af51f172SAndreas Gohr**Question Rephrasing**
103af51f172SAndreas Gohr- Converts follow-up questions into standalone questions using chat history
104af51f172SAndreas Gohr- Controlled by `rephraseHistory` config (number of history entries to use)
105af51f172SAndreas Gohr- Only applied when rephraseHistory > chatHistory to avoid redundancy
106af51f172SAndreas Gohr
107af51f172SAndreas Gohr**Context Management**
108af51f172SAndreas Gohr- Chunks include breadcrumb trail (namespace hierarchy + page title)
109af51f172SAndreas Gohr- Token counting uses tiktoken-php for accurate limits
110af51f172SAndreas Gohr- Respects model's max input token length
111af51f172SAndreas Gohr- Filters chunks by ACL permissions and similarity threshold
112af51f172SAndreas Gohr
113af51f172SAndreas Gohr**Language Support**
114af51f172SAndreas Gohr- `preferUIlanguage` setting controls language behavior:
115af51f172SAndreas Gohr  - LANG_AUTO_ALL: Auto-detect from question
116af51f172SAndreas Gohr  - LANG_UI_ALL: Always use UI language
117af51f172SAndreas Gohr  - LANG_UI_LIMITED: Use UI language and limit sources to that language
118af51f172SAndreas Gohr
119af51f172SAndreas Gohr### AJAX Integration
120af51f172SAndreas Gohr
121af51f172SAndreas Gohr**action.php**
122af51f172SAndreas Gohr- Handles `AJAX_CALL_UNKNOWN` event for 'aichat' calls
123af51f172SAndreas Gohr- Processes questions with chat history
124af51f172SAndreas Gohr- Returns JSON with answer (as rendered Markdown), sources, and similarity scores
125af51f172SAndreas Gohr- Implements access restrictions via helper->userMayAccess()
126af51f172SAndreas Gohr- Optional logging of all interactions
127af51f172SAndreas Gohr
128af51f172SAndreas Gohr### Frontend
129af51f172SAndreas Gohr- **script/**: JavaScript for UI integration
130af51f172SAndreas Gohr- **syntax/**: DokuWiki syntax components
131af51f172SAndreas Gohr- **renderer.php**: Custom renderer for AI chat output
132af51f172SAndreas Gohr
133af51f172SAndreas Gohr## Configuration
134af51f172SAndreas Gohr
135af51f172SAndreas GohrPlugin configuration is in `conf/`:
136af51f172SAndreas Gohr- **default.php**: Default config values
137af51f172SAndreas Gohr- **metadata.php**: Config field definitions and validation
138af51f172SAndreas Gohr
139af51f172SAndreas GohrKey settings:
140af51f172SAndreas Gohr- Model selection: chatmodel, rephrasemodel, embedmodel
141af51f172SAndreas Gohr- Storage: storage backend type
142af51f172SAndreas Gohr- API keys: openai_apikey, gemini_apikey, etc.
143af51f172SAndreas Gohr- Chunk settings: chunkSize, contextChunks, similarityThreshold
144af51f172SAndreas Gohr- History: chatHistory, rephraseHistory
145af51f172SAndreas Gohr- Access: restrict (user/group restrictions)
146af51f172SAndreas Gohr- Indexing filters: skipRegex, matchRegex
147af51f172SAndreas Gohr
148af51f172SAndreas Gohr## Testing
149af51f172SAndreas Gohr
150af51f172SAndreas GohrTests are in `_test/` directory:
151af51f172SAndreas Gohr- Extends DokuWikiTest base class
152af51f172SAndreas Gohr- Uses @group plugin_aichat annotation
153af51f172SAndreas Gohr
154af51f172SAndreas Gohr## Important Implementation Notes
155af51f172SAndreas Gohr
156af51f172SAndreas Gohr- All token counting uses TikToken encoder for rough estimates
157af51f172SAndreas Gohr- Chunk IDs are calculated as: pageID * 100 + chunk_sequence (pageIDs come from DokuWiki's internal search index)
158af51f172SAndreas Gohr- Models are cached in ModelFactory to avoid re-initialization
159af51f172SAndreas Gohr- API retries use exponential backoff (sleep for retry count seconds)
160af51f172SAndreas Gohr- Breadcrumb trails provide context to AI without requiring full page content
161af51f172SAndreas Gohr- Storage backends handle similarity search differently but provide unified interface
162af51f172SAndreas Gohr- UTF-8 handling is critical for text splitting (uses dokuwiki\Utf8\PhpString)
163