xref: /dokuwiki/inc/Search/concept.txt (revision 1148921de6af6909f19cb5b30b698d0f27d7751e)
1====== Search Indexing ======
2
3The indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future.
4
5===== Core API =====
6
7Most code interacting with the search index will use one of three high-level classes. All live in the ''\dokuwiki\Search'' namespace.
8
9==== Indexer ====
10
11The ''Indexer'' class manages the search index. It coordinates all collections and handles the actual writing.
12
13  * ''addPage($page, $force)'' - Add or re-index a page. Handles locking, tokenization, and metadata internally.
14  * ''deletePage($page, $force)'' - Remove a page from all indexes.
15  * ''renamePage($oldpage, $newpage)'' - Update the page name in the entity index.
16  * ''needsIndexing($page, $force)'' - Check whether a page needs (re-)indexing based on version and modification time.
17  * ''getAllPages($existsFilter)'' - Return all indexed page names. Optionally filter to only pages that exist on disk.
18  * ''getVersion()'' - Return the indexer version string, including plugin versions (see [[devel:event:INDEXER_VERSION_GET]]).
19  * ''clear()'' - Delete all index files.
20  * ''checkIntegrity()'' - Verify structural consistency across all indexes.
21  * ''setLogger($callback)'' - Register a logging callback for progress output.
22
23==== FulltextSearch ====
24
25The ''FulltextSearch'' class handles fulltext search queries.
26
27  * ''pageSearch($query, &$highlight, $sort, $after, $before)'' - Run a fulltext search. Returns matching pages as ''pageid => score''. The ''$highlight'' array is filled with terms to highlight. ''$sort'' can be ''"hits"'' (default) or ''"mtime"''. ''$after''/''$before'' filter by modification time.
28  * ''snippet($id, $highlight)'' - Generate a search result snippet for a page.
29
30==== MetadataSearch ====
31
32The ''MetadataSearch'' class provides search operations on page metadata.
33
34  * ''pageLookup($id, $in_ns, $in_title, $after, $before)'' - Quick search for page names. Optionally matches against the namespace and title.
35  * ''lookupKey($key, $value)'' - Find pages by metadata value. Supports exact match and wildcards (''*'' at start/end). When ''$value'' is a string the result is a flat list of page names. When it is an array, the result is keyed by each search value.
36  * ''backlinks($id, $ignore_perms)'' - Find all pages linking to ''$id''.
37  * ''mediause($id, $ignore_perms)'' - Find all pages using a media file.
38  * ''getPages($key)'' - Return all indexed pages, optionally limited to those having a value for the given metadata key.
39
40===== Internals =====
41
42The following sections describe how the search index is structured and how the core classes work together to provide the indexing and search functionality. This is meant for developers who want to understand the inner workings of the search system or make use of it in their own plugins.
43
44==== Indexes ====
45
46Indexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies.
47
48Indexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code.
49
50All index files are stored in the ''data/index'' directory. The file name is the name of the index with an ''idx'' extension. For example, the page name index is stored in ''data/index/page.idx''. Some indexes have additional suffixes (eg. ''w3.idx'') to split the data into multiple files (see [[#Index File Splitting]] below).
51
52Index files can be accessed through two classes:
53
54  * ''\dokuwiki\Search\Index\FileIndex''
55  * ''\dokuwiki\Search\Index\MemoryIndex''
56
57Both classes expose the same API, the only difference is their way of accessing the data.
58
59A FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index.
60
61The MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory.
62
63Which method to use depends mostly on the size of the file.
64
65Usually indexes are not accessed directly but through a collection. That collection will manage which type of access to use.
66
67Within an index two kinds of data can be stored per row:
68
69  * A single value. Eg. an entity or a token
70  * A list of tuples. Eg. a list of pageIDs and frequencies
71
72The former is straight forward, it's a simple ''rid -> value'' store. The latter maps to ''rid -> [key -> value, ...]'' where key is usually the ''rid'' in another index.
73
74=== Index File Splitting ===
75
76To improve memory efficiency and access speed, token and frequency indexes can be split into multiple physical files using suffixes based on token length. A suffix parameter is appended to the base index name to create the actual filename. For example:
77
78  * Base name: ''w'' (for word tokens)
79  * Suffix: ''3'' (for 3-letter tokens)
80  * Resulting file: ''w3.idx''
81
82> Note: token lengths are counted in bytes, not characters. This means that for languages with multi-byte characters, the suffixes will reflect the byte length of the tokens, which may differ from the character count.
83
84In a fulltext collection with splitting enabled:
85
86  * ''w3.idx'' / ''i3.idx'' - stores all 3-letter tokens and their frequencies
87  * ''w4.idx'' / ''i4.idx'' - stores all 4-letter tokens and their frequencies
88  * ''w5.idx'' / ''i5.idx'' - stores all 5-letter tokens and their frequencies
89  * and so on...
90
91When splitting is disabled, a single file is used for each index (eg. ''relation_media_w.idx'').
92
93When an index uses suffixes, the ''max()'' method can be used to find the highest numeric suffix currently in use. This is useful for operations that need to iterate over all splits of an index (eg. when a Term is using a wildcard).
94
95=== Tuple Data Format ===
96
97Tuple-based index rows store associations between keys (typically ''rid''s from another index) and numeric values (typically frequency counts). The internal format uses a compact string representation:
98
99<code>
100key*count:key*count:key*count
101</code>
102
103Where:
104  * ''key'' - Usually the ''rid'' from another index (e.g., a page ID)
105  * ''count'' - A numeric value (e.g., how many times a word appears on that page)
106  * '':'' - Separates individual tuples
107  * ''*'' - Separates the key from its count within a tuple
108
109**Example:** A frequency index row for a word might look like:
110<code>
11142*5:17*3:98*12
112</code>
113
114This means:
115  * Entity with RID 42 contains this word 5 times
116  * Entity with RID 17 contains this word 3 times
117  * Entity with RID 98 contains this word 12 times
118
119Frequencies of 1 are not stored in the index. For example:
120
121<code>
12242*5:17:98
123</code>
124
125In the above case would be interpreted as
126
127  * Entity with RID 42 contains this word 5 times
128  * Entity with RID 17 contains this word 1 times
129  * Entity with RID 98 contains this word 1 times
130
131The ''TupleOps'' class provides utility methods for working with tuple records:
132  * ''updateTuple()'' - Insert or update a specific key->count pair
133  * ''parseTuples()'' - Parse a record into an array of key->count associations
134  * ''aggregateTupleCounts()'' - Sum all counts in a record
135
136==== Collections ====
137
138A collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection.
139
140> Please note: because index has a specific meaning in our context (see above) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection.
141
142A collection manages up to four indexes:
143
144  * **entity** - The main entity that will be the result of a search. Eg. a page. entity.RID -> entity
145  * **token** - The actual information strewn across the entities. Eg. words. token.RID -> token
146  * **frequency** - Maps tokens to entities and records their frequency. token.RID -> entity.RID*frequency:...
147  * **reverse** - Records which tokens are assigned to each entity. Used for updating: when an entity is re-indexed, the old reverse record provides the list of tokens to clean up.
148
149The reverse index format depends on whether the collection uses split indexes:
150  * **Split collections**: Each entry is a ''tokenLength*tokenId'' pair because the token length is needed to locate the correct split index file. Format: ''tokenLength*tokenId:tokenLength*tokenId:...''
151  * **Non-split collections**: Only the token ID is needed since all tokens live in a single file. Format: ''tokenId:tokenId:...''
152
153Collections have two independent properties: a type and whether they use split indexes or not.
154
155The **collection type** determines how tokens relate to entities:
156
157  * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case.
158  * lookup collections - Basically the same as frequency collections, but each token appears only once per entity thus all frequencies are 1. Searches do not care for the frequency but are only interested if a token appears for the entity or not. Internally the same mechanisms are used; only the way tokens are processed on input differs (deduplication instead of counting).
159  * direct collections - Here a 1:1 relation between the entity and a token exists. For example a page has exactly one title. Direct collections only use entity and token index files (entity.RID === token.RID), no frequency or reverse indexes.
160
161Independently of the collection type, a collection can use **split or non-split token indexes**. See the [[#Index File Splitting]] section above.
162
163^ Name                   ^ Type      ^ Split? ^ Entity ^ Token                 ^ Frequency             ^ Reverse               ^
164| FullText               | frequency | yes    | page   | w*                    | i*                    | pageword              |
165| Title                  | direct    | no     | page   | title                 | -                     | -                     |
166| MetaRelationMedia      | lookup    | no     | page   | relation_media_w      | relation_media_i      | relation_media_p      |
167| MetaRelationReferences | lookup    | no     | page   | relation_references_w | relation_references_i | relation_references_p |
168
169=== Writing data ===
170
171''addEntity($entity, $tokens)'' is the main method for writing data to a collection. It replaces all previously stored tokens for the given entity. An empty token list removes the entity's data. The collection must be locked before calling this method.
172
173<code php>
174$collection = new PageFulltextCollection($pageIndex);
175$collection->lock();
176$collection->addEntity('wiki:page', $words);
177$collection->unlock();
178</code>
179
180Internally, ''addEntity()'' reads the reverse index to find the entity's old tokens, resolves the new tokens to IDs (creating them in the token index if needed), merges old and new, and updates the frequency and reverse indexes accordingly. Tokens no longer present are automatically removed from the frequency index.
181
182For direct collections, ''addEntity()'' simply writes the first token at the entity's position in the token index.
183
184=== Reading data ===
185
186Collections provide some basic information retrieval methods, but they are not meant for searching.
187
188  * ''getEntitiesWithData()'' - Return all entity names that have data in this collection.
189  * For direct collections, ''getToken($entity)'' retrieves the single token stored for an entity (eg. a page title).
190
191Searching across a collection is done through the ''CollectionSearch'' class (see below).
192
193==== Locking ====
194
195Only one process may write to an index at any time. To ensure this, a locking mechanism has to be employed.
196
197Indexes are opened in readonly mode by default. Passing ''$isWritable = true'' to the constructor (or calling ''lock()'' later) acquires a lock and enables writing. Calling ''unlock()'' releases it.
198
199The ''Lock'' class is a static registry with reference counting. ''Lock::acquire($name)'' creates a filesystem lock directory. Multiple calls within the same process share a single lock via reference counting. ''Lock::release($name)'' decrements the count and removes the directory when it reaches zero. Stale locks older than 5 minutes are automatically broken.
200
201Collections call ''lock()'' to acquire locks for all their indexes at once, and ''unlock()'' to release them.
202
203==== Tokenizer ====
204
205The ''Tokenizer'' class (in ''\dokuwiki\Search'') is responsible for splitting text into indexable tokens.
206
207''Tokenizer::getWords($text, $wc)'' splits the given text into an array of lowercase tokens. Tokens shorter than the minimum word length (default 2, configurable via ''IDX_MINWORDLENGTH'') are discarded, as are language-specific stop words loaded from ''inc/lang/<lang>/stopwords.txt''. Asian characters receive special treatment: they are separated into individual characters and measured with a length function that accounts for multi-byte sequences.
208
209When ''$wc'' is true, wildcard characters (''*'') are preserved in the output. This is used by the query parser.
210
211The [[devel:event:INDEXER_TEXT_PREPARE]] event fires before tokenization, allowing plugins to pre-process the text.
212
213==== CollectionSearch and Terms ====
214
215The ''CollectionSearch'' class executes searches against any collection. Use ''addTerm()'' to register search terms, then call ''execute()''.
216
217<code php>
218$search = new CollectionSearch($collection);
219$search->addTerm('wiki*');
220$terms = $search->execute();
221foreach ($terms as $term) {
222    $term->getEntityFrequencies(); // [entityName => totalFrequency, ...]
223    $term->getEntityTokens();      // [entityName => [tokenName, ...], ...]
224    $term->getMatches();           // [entityName => [tokenName => freq, ...], ...]
225}
226</code>
227
228''addTerm()'' returns a ''Term'' object. After ''execute()'', each Term holds the full match detail: which tokens matched on which entities with what frequencies. Various accessors provide different views on this data.
229
230A ''Term'' represents a single search query component that can match one or more tokens in an index. Terms can include wildcards using the ''*'' character:
231  * ''wiki'' - matches exactly "wiki"
232  * ''wiki*'' - matches tokens starting with "wiki" (e.g., "wiki", "wikitext", "wikipedia")
233  * ''*wiki'' - matches tokens ending with "wiki" (e.g., "wiki", "dokuwiki")
234  * ''*wiki*'' - matches tokens containing "wiki" anywhere (e.g., "wiki", "dokuwiki", "wikitext")
235
236Matching uses efficient string functions (''==='' for exact, ''str_starts_with''/''str_ends_with''/''str_contains'' for wildcards). For case-insensitive matching, call ''caseInsensitive()'' on the search or on individual terms. This is useful for metadata/title searches where indexed values preserve case (the fulltext token index is already lowercased by the Tokenizer).
237
238Terms organize their matching tokens by length. This is crucial for working with split indexes: a term like ''*wiki*'' might match 4-letter words (wiki), 8-letter words (dokuwiki), and 9-letter words (wikilinks) but never 3-letter words, because the base term "wiki" is 4 letters long. Each length group can be looked up in the corresponding suffixed token index, allowing efficient searching without loading irrelevant files.
239
240For example, searching for ''wiki*'' might find:
241  * Token "wiki" appears 5 times on page "start"
242  * Token "wikitext" appears 3 times on page "start"
243  * ''getEntityFrequencies()'' returns ''['start' => 8]''
244  * ''getMatches()'' returns ''['start' => ['wiki' => 5, 'wikitext' => 3]]''
245
246Term does not enforce minimum token length. For fulltext search, callers should filter short words before calling ''addTerm()'' using ''Tokenizer::getMinWordLength()''.
247
248==== Fulltext Search Query Processing ====
249
250For fulltext searches a proper query language is supported (see [[:Search]]). Queries go through two stages:
251
252=== QueryParser ===
253
254''QueryParser::convert($query)'' parses a search query string into an intermediate representation. It supports:
255
256  * Individual words and phrases (quoted strings)
257  * Namespace filtering with ''@ns:'' or ''ns:'' and ''-ns:'' for exclusion
258  * Negation with ''-'' prefix
259  * Boolean ''OR'' between terms
260  * Grouping with parentheses
261
262The output includes an array in Reverse Polish Notation (RPN) used by the evaluator, plus extracted highlights, word lists, phrase lists, and namespace filters.
263
264=== QueryEvaluator ===
265
266''QueryEvaluator'' takes the RPN array and the ''Term'' results from ''CollectionSearch'' and evaluates the boolean logic. It uses typed stack entries during processing:
267
268  * **PageSet** - Concrete set of pages with scores. Supports intersect (AND), unite (OR), subtract (NOT).
269  * **NamespacePredicate** - Lazy filter that only materializes when combined with a PageSet.
270  * **NegatedEntry** - Wraps another entry to represent logical NOT, allowing AND to convert it to set subtraction.
271
272The result is a list of matching pages and their frequency scores.
273
274Phrase verification reads the raw wiki text of candidate pages. Plugins can override this via [[devel:event:FULLTEXT_PHRASE_MATCH]].
275
276
277==== Background Indexing ====
278
279Pages are indexed asynchronously by the [[:taskrunner|TaskRunner]] which is triggered after each page view. It calls ''Indexer::addPage()'' for pages that need re-indexing and ''Indexer::deletePage()'' for pages that no longer exist on disk. The CLI tool ''bin/indexer.php'' can be used to index all pages at once.
280
281The [[devel:event:INDEXER_TASKS_RUN]] event fires during background task execution, allowing plugins to hook their own maintenance tasks into the indexing cycle.
282
283===== Plugin Events =====
284
285The search system fires several events that plugins can use to extend or modify indexing and search behavior.
286
287Indexing:
288  * [[devel:event:INDEXER_VERSION_GET]] - Plugins add their version to force re-indexing when the plugin changes.
289  * [[devel:event:INDEXER_PAGE_ADD]] - Modify page body, title, or metadata before it enters the index.
290  * [[devel:event:INDEXER_TEXT_PREPARE]] - Pre-process text before tokenization.
291  * [[devel:event:INDEXER_TASKS_RUN]] - Hook into the background task runner.
292
293Searching:
294  * [[devel:event:SEARCH_QUERY_FULLPAGE]] - Intercept or replace fulltext search.
295  * [[devel:event:SEARCH_QUERY_PAGELOOKUP]] - Intercept or replace page name lookup.
296  * [[devel:event:FULLTEXT_SNIPPET_CREATE]] - Provide custom search result snippets.
297  * [[devel:event:FULLTEXT_PHRASE_MATCH]] - Override phrase matching logic.
298
299===== Exceptions =====
300
301All search-related exceptions extend ''SearchException'':
302
303  * ''SearchException'' - Base class for search/index errors
304  * ''IndexAccessException'' - Failed to read an index file
305  * ''IndexWriteException'' - Failed to write to an index file
306  * ''IndexLockException'' - Failed to acquire or release a lock
307  * ''IndexUsageException'' - Incorrect API usage (eg. writing without a lock)
308  * ''IndexIntegrityException'' - Structural inconsistency detected in the indexes
309