xref: /dokuwiki/inc/Search/concept.txt (revision 2a22d4b9fca21fe1c728c4462c02b97b2e70e0ee)
1596d5287SAndreas Gohr====== Search Indexing ======
2596d5287SAndreas Gohr
3596d5287SAndreas GohrThe indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future.
4596d5287SAndreas Gohr
5b9d7a615SAndreas Gohr===== Core API =====
6596d5287SAndreas Gohr
7b9d7a615SAndreas GohrMost code interacting with the search index will use one of three high-level classes. All live in the ''\dokuwiki\Search'' namespace.
8596d5287SAndreas Gohr
9b9d7a615SAndreas Gohr==== Indexer ====
10596d5287SAndreas Gohr
11b9d7a615SAndreas GohrThe ''Indexer'' class manages the search index. It coordinates all collections and handles the actual writing.
12f2bbffb5SAndreas Gohr
13b9d7a615SAndreas Gohr  * ''addPage($page, $force)'' - Add or re-index a page. Handles locking, tokenization, and metadata internally.
14b9d7a615SAndreas Gohr  * ''deletePage($page, $force)'' - Remove a page from all indexes.
15b9d7a615SAndreas Gohr  * ''renamePage($oldpage, $newpage)'' - Update the page name in the entity index.
16b9d7a615SAndreas Gohr  * ''needsIndexing($page, $force)'' - Check whether a page needs (re-)indexing based on version and modification time.
17b9d7a615SAndreas Gohr  * ''getAllPages($existsFilter)'' - Return all indexed page names. Optionally filter to only pages that exist on disk.
18b9d7a615SAndreas Gohr  * ''getVersion()'' - Return the indexer version string, including plugin versions (see [[devel:event:INDEXER_VERSION_GET]]).
19b9d7a615SAndreas Gohr  * ''clear()'' - Delete all index files.
20b9d7a615SAndreas Gohr  * ''checkIntegrity()'' - Verify structural consistency across all indexes.
21b9d7a615SAndreas Gohr  * ''setLogger($callback)'' - Register a logging callback for progress output.
22f2bbffb5SAndreas Gohr
23b9d7a615SAndreas Gohr==== FulltextSearch ====
24596d5287SAndreas Gohr
25b9d7a615SAndreas GohrThe ''FulltextSearch'' class handles fulltext search queries.
26596d5287SAndreas Gohr
27b9d7a615SAndreas Gohr  * ''pageSearch($query, &$highlight, $sort, $after, $before)'' - Run a fulltext search. Returns matching pages as ''pageid => score''. The ''$highlight'' array is filled with terms to highlight. ''$sort'' can be ''"hits"'' (default) or ''"mtime"''. ''$after''/''$before'' filter by modification time.
28b9d7a615SAndreas Gohr  * ''snippet($id, $highlight)'' - Generate a search result snippet for a page.
29f2bbffb5SAndreas Gohr
30b9d7a615SAndreas Gohr==== MetadataSearch ====
31f2bbffb5SAndreas Gohr
32b9d7a615SAndreas GohrThe ''MetadataSearch'' class provides search operations on page metadata.
33f2bbffb5SAndreas Gohr
34b9d7a615SAndreas Gohr  * ''pageLookup($id, $in_ns, $in_title, $after, $before)'' - Quick search for page names. Optionally matches against the namespace and title.
351148921dSAndreas Gohr  * ''lookupKey($key, $value)'' - Find pages by metadata value. Supports exact match and wildcards (''*'' at start/end). When ''$value'' is a string the result is a flat list of page names. When it is an array, the result is keyed by each search value.
36b9d7a615SAndreas Gohr  * ''backlinks($id, $ignore_perms)'' - Find all pages linking to ''$id''.
37b9d7a615SAndreas Gohr  * ''mediause($id, $ignore_perms)'' - Find all pages using a media file.
38b9d7a615SAndreas Gohr  * ''getPages($key)'' - Return all indexed pages, optionally limited to those having a value for the given metadata key.
39596d5287SAndreas Gohr
40b9d7a615SAndreas Gohr===== Internals =====
41596d5287SAndreas Gohr
42b9d7a615SAndreas GohrThe following sections describe how the search index is structured and how the core classes work together to provide the indexing and search functionality. This is meant for developers who want to understand the inner workings of the search system or make use of it in their own plugins.
43596d5287SAndreas Gohr
44b9d7a615SAndreas Gohr==== Indexes ====
45596d5287SAndreas Gohr
46596d5287SAndreas GohrIndexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies.
47596d5287SAndreas Gohr
48596d5287SAndreas GohrIndexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code.
49596d5287SAndreas Gohr
50b9d7a615SAndreas GohrAll index files are stored in the ''data/index'' directory. The file name is the name of the index with an ''idx'' extension. For example, the page name index is stored in ''data/index/page.idx''. Some indexes have additional suffixes (eg. ''w3.idx'') to split the data into multiple files (see [[#Index File Splitting]] below).
51b9d7a615SAndreas Gohr
52596d5287SAndreas GohrIndex files can be accessed through two classes:
53596d5287SAndreas Gohr
54b9d7a615SAndreas Gohr  * ''\dokuwiki\Search\Index\FileIndex''
55b9d7a615SAndreas Gohr  * ''\dokuwiki\Search\Index\MemoryIndex''
56596d5287SAndreas Gohr
57596d5287SAndreas GohrBoth classes expose the same API, the only difference is their way of accessing the data.
58596d5287SAndreas Gohr
59596d5287SAndreas GohrA FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index.
60596d5287SAndreas Gohr
61596d5287SAndreas GohrThe MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory.
62596d5287SAndreas Gohr
63596d5287SAndreas GohrWhich method to use depends mostly on the size of the file.
64596d5287SAndreas Gohr
65596d5287SAndreas GohrUsually indexes are not accessed directly but through a collection. That collection will manage which type of access to use.
66596d5287SAndreas Gohr
67596d5287SAndreas GohrWithin an index two kinds of data can be stored per row:
68596d5287SAndreas Gohr
69596d5287SAndreas Gohr  * A single value. Eg. an entity or a token
70596d5287SAndreas Gohr  * A list of tuples. Eg. a list of pageIDs and frequencies
71596d5287SAndreas Gohr
72b9d7a615SAndreas GohrThe former is straight forward, it's a simple ''rid -> value'' store. The latter maps to ''rid -> [key -> value, ...]'' where key is usually the ''rid'' in another index.
73596d5287SAndreas Gohr
74b9d7a615SAndreas Gohr=== Index File Splitting ===
758ae94493SAndreas Gohr
76b9d7a615SAndreas GohrTo improve memory efficiency and access speed, token and frequency indexes can be split into multiple physical files using suffixes based on token length. A suffix parameter is appended to the base index name to create the actual filename. For example:
778ae94493SAndreas Gohr
788ae94493SAndreas Gohr  * Base name: ''w'' (for word tokens)
79f2bbffb5SAndreas Gohr  * Suffix: ''3'' (for 3-letter tokens)
808ae94493SAndreas Gohr  * Resulting file: ''w3.idx''
818ae94493SAndreas Gohr
82b9d7a615SAndreas Gohr> Note: token lengths are counted in bytes, not characters. This means that for languages with multi-byte characters, the suffixes will reflect the byte length of the tokens, which may differ from the character count.
83b9d7a615SAndreas Gohr
84f2bbffb5SAndreas GohrIn a fulltext collection with splitting enabled:
858ae94493SAndreas Gohr
86f2bbffb5SAndreas Gohr  * ''w3.idx'' / ''i3.idx'' - stores all 3-letter tokens and their frequencies
87f2bbffb5SAndreas Gohr  * ''w4.idx'' / ''i4.idx'' - stores all 4-letter tokens and their frequencies
88f2bbffb5SAndreas Gohr  * ''w5.idx'' / ''i5.idx'' - stores all 5-letter tokens and their frequencies
898ae94493SAndreas Gohr  * and so on...
908ae94493SAndreas Gohr
91f2bbffb5SAndreas GohrWhen splitting is disabled, a single file is used for each index (eg. ''relation_media_w.idx'').
92f2bbffb5SAndreas Gohr
938ae94493SAndreas GohrWhen an index uses suffixes, the ''max()'' method can be used to find the highest numeric suffix currently in use. This is useful for operations that need to iterate over all splits of an index (eg. when a Term is using a wildcard).
948ae94493SAndreas Gohr
95b9d7a615SAndreas Gohr=== Tuple Data Format ===
968ae94493SAndreas Gohr
97b9d7a615SAndreas GohrTuple-based index rows store associations between keys (typically ''rid''s from another index) and numeric values (typically frequency counts). The internal format uses a compact string representation:
988ae94493SAndreas Gohr
998ae94493SAndreas Gohr<code>
1008ae94493SAndreas Gohrkey*count:key*count:key*count
1018ae94493SAndreas Gohr</code>
1028ae94493SAndreas Gohr
1038ae94493SAndreas GohrWhere:
104b9d7a615SAndreas Gohr  * ''key'' - Usually the ''rid'' from another index (e.g., a page ID)
1058ae94493SAndreas Gohr  * ''count'' - A numeric value (e.g., how many times a word appears on that page)
1068ae94493SAndreas Gohr  * '':'' - Separates individual tuples
1078ae94493SAndreas Gohr  * ''*'' - Separates the key from its count within a tuple
1088ae94493SAndreas Gohr
1098ae94493SAndreas Gohr**Example:** A frequency index row for a word might look like:
1108ae94493SAndreas Gohr<code>
1118ae94493SAndreas Gohr42*5:17*3:98*12
1128ae94493SAndreas Gohr</code>
1138ae94493SAndreas Gohr
1148ae94493SAndreas GohrThis means:
1158ae94493SAndreas Gohr  * Entity with RID 42 contains this word 5 times
1168ae94493SAndreas Gohr  * Entity with RID 17 contains this word 3 times
1178ae94493SAndreas Gohr  * Entity with RID 98 contains this word 12 times
1188ae94493SAndreas Gohr
1198ae94493SAndreas GohrFrequencies of 1 are not stored in the index. For example:
1208ae94493SAndreas Gohr
1218ae94493SAndreas Gohr<code>
1228ae94493SAndreas Gohr42*5:17:98
1238ae94493SAndreas Gohr</code>
1248ae94493SAndreas Gohr
1258ae94493SAndreas GohrIn the above case would be interpreted as
1268ae94493SAndreas Gohr
1278ae94493SAndreas Gohr  * Entity with RID 42 contains this word 5 times
1288ae94493SAndreas Gohr  * Entity with RID 17 contains this word 1 times
1298ae94493SAndreas Gohr  * Entity with RID 98 contains this word 1 times
1308ae94493SAndreas Gohr
1318ae94493SAndreas GohrThe ''TupleOps'' class provides utility methods for working with tuple records:
1328ae94493SAndreas Gohr  * ''updateTuple()'' - Insert or update a specific key->count pair
1338ae94493SAndreas Gohr  * ''parseTuples()'' - Parse a record into an array of key->count associations
1348ae94493SAndreas Gohr  * ''aggregateTupleCounts()'' - Sum all counts in a record
1358ae94493SAndreas Gohr
136b9d7a615SAndreas Gohr==== Collections ====
137b9d7a615SAndreas Gohr
138b9d7a615SAndreas GohrA collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection.
139b9d7a615SAndreas Gohr
140b9d7a615SAndreas Gohr> Please note: because index has a specific meaning in our context (see above) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection.
141b9d7a615SAndreas Gohr
142b9d7a615SAndreas GohrA collection manages up to four indexes:
143b9d7a615SAndreas Gohr
144b9d7a615SAndreas Gohr  * **entity** - The main entity that will be the result of a search. Eg. a page. entity.RID -> entity
145b9d7a615SAndreas Gohr  * **token** - The actual information strewn across the entities. Eg. words. token.RID -> token
146b9d7a615SAndreas Gohr  * **frequency** - Maps tokens to entities and records their frequency. token.RID -> entity.RID*frequency:...
147b9d7a615SAndreas Gohr  * **reverse** - Records which tokens are assigned to each entity. Used for updating: when an entity is re-indexed, the old reverse record provides the list of tokens to clean up.
148b9d7a615SAndreas Gohr
149b9d7a615SAndreas GohrThe reverse index format depends on whether the collection uses split indexes:
150b9d7a615SAndreas Gohr  * **Split collections**: Each entry is a ''tokenLength*tokenId'' pair because the token length is needed to locate the correct split index file. Format: ''tokenLength*tokenId:tokenLength*tokenId:...''
151b9d7a615SAndreas Gohr  * **Non-split collections**: Only the token ID is needed since all tokens live in a single file. Format: ''tokenId:tokenId:...''
152b9d7a615SAndreas Gohr
153b9d7a615SAndreas GohrCollections have two independent properties: a type and whether they use split indexes or not.
154b9d7a615SAndreas Gohr
155b9d7a615SAndreas GohrThe **collection type** determines how tokens relate to entities:
156b9d7a615SAndreas Gohr
157b9d7a615SAndreas Gohr  * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case.
158b9d7a615SAndreas Gohr  * lookup collections - Basically the same as frequency collections, but each token appears only once per entity thus all frequencies are 1. Searches do not care for the frequency but are only interested if a token appears for the entity or not. Internally the same mechanisms are used; only the way tokens are processed on input differs (deduplication instead of counting).
159b9d7a615SAndreas Gohr  * direct collections - Here a 1:1 relation between the entity and a token exists. For example a page has exactly one title. Direct collections only use entity and token index files (entity.RID === token.RID), no frequency or reverse indexes.
160b9d7a615SAndreas Gohr
161b9d7a615SAndreas GohrIndependently of the collection type, a collection can use **split or non-split token indexes**. See the [[#Index File Splitting]] section above.
162b9d7a615SAndreas Gohr
163b9d7a615SAndreas Gohr^ Name                   ^ Type      ^ Split? ^ Entity ^ Token                 ^ Frequency             ^ Reverse               ^
164b9d7a615SAndreas Gohr| FullText               | frequency | yes    | page   | w*                    | i*                    | pageword              |
165b9d7a615SAndreas Gohr| Title                  | direct    | no     | page   | title                 | -                     | -                     |
166b9d7a615SAndreas Gohr| MetaRelationMedia      | lookup    | no     | page   | relation_media_w      | relation_media_i      | relation_media_p      |
167b9d7a615SAndreas Gohr| MetaRelationReferences | lookup    | no     | page   | relation_references_w | relation_references_i | relation_references_p |
168b9d7a615SAndreas Gohr
169b9d7a615SAndreas Gohr=== Writing data ===
170b9d7a615SAndreas Gohr
171b9d7a615SAndreas Gohr''addEntity($entity, $tokens)'' is the main method for writing data to a collection. It replaces all previously stored tokens for the given entity. An empty token list removes the entity's data. The collection must be locked before calling this method.
172b9d7a615SAndreas Gohr
173b9d7a615SAndreas Gohr<code php>
174b9d7a615SAndreas Gohr$collection = new PageFulltextCollection($pageIndex);
175b9d7a615SAndreas Gohr$collection->lock();
176b9d7a615SAndreas Gohr$collection->addEntity('wiki:page', $words);
177b9d7a615SAndreas Gohr$collection->unlock();
178b9d7a615SAndreas Gohr</code>
179b9d7a615SAndreas Gohr
180b9d7a615SAndreas GohrInternally, ''addEntity()'' reads the reverse index to find the entity's old tokens, resolves the new tokens to IDs (creating them in the token index if needed), merges old and new, and updates the frequency and reverse indexes accordingly. Tokens no longer present are automatically removed from the frequency index.
181b9d7a615SAndreas Gohr
182b9d7a615SAndreas GohrFor direct collections, ''addEntity()'' simply writes the first token at the entity's position in the token index.
183b9d7a615SAndreas Gohr
184b9d7a615SAndreas Gohr=== Reading data ===
185b9d7a615SAndreas Gohr
186b9d7a615SAndreas GohrCollections provide some basic information retrieval methods, but they are not meant for searching.
187b9d7a615SAndreas Gohr
188b9d7a615SAndreas Gohr  * ''getEntitiesWithData()'' - Return all entity names that have data in this collection.
189b9d7a615SAndreas Gohr  * For direct collections, ''getToken($entity)'' retrieves the single token stored for an entity (eg. a page title).
190b9d7a615SAndreas Gohr
191b9d7a615SAndreas GohrSearching across a collection is done through the ''CollectionSearch'' class (see below).
192b9d7a615SAndreas Gohr
193b9d7a615SAndreas Gohr==== Locking ====
194596d5287SAndreas Gohr
195596d5287SAndreas GohrOnly one process may write to an index at any time. To ensure this, a locking mechanism has to be employed.
196596d5287SAndreas Gohr
197b9d7a615SAndreas GohrIndexes are opened in readonly mode by default. Passing ''$isWritable = true'' to the constructor (or calling ''lock()'' later) acquires a lock and enables writing. Calling ''unlock()'' releases it.
198596d5287SAndreas Gohr
199b9d7a615SAndreas GohrThe ''Lock'' class is a static registry with reference counting. ''Lock::acquire($name)'' creates a filesystem lock directory. Multiple calls within the same process share a single lock via reference counting. ''Lock::release($name)'' decrements the count and removes the directory when it reaches zero. Stale locks older than 5 minutes are automatically broken.
200b9d7a615SAndreas Gohr
201b9d7a615SAndreas GohrCollections call ''lock()'' to acquire locks for all their indexes at once, and ''unlock()'' to release them.
202b9d7a615SAndreas Gohr
203b9d7a615SAndreas Gohr==== Tokenizer ====
204b9d7a615SAndreas Gohr
205b9d7a615SAndreas GohrThe ''Tokenizer'' class (in ''\dokuwiki\Search'') is responsible for splitting text into indexable tokens.
206b9d7a615SAndreas Gohr
207b9d7a615SAndreas Gohr''Tokenizer::getWords($text, $wc)'' splits the given text into an array of lowercase tokens. Tokens shorter than the minimum word length (default 2, configurable via ''IDX_MINWORDLENGTH'') are discarded, as are language-specific stop words loaded from ''inc/lang/<lang>/stopwords.txt''. Asian characters receive special treatment: they are separated into individual characters and measured with a length function that accounts for multi-byte sequences.
208b9d7a615SAndreas Gohr
209b9d7a615SAndreas GohrWhen ''$wc'' is true, wildcard characters (''*'') are preserved in the output. This is used by the query parser.
210b9d7a615SAndreas Gohr
211b9d7a615SAndreas GohrThe [[devel:event:INDEXER_TEXT_PREPARE]] event fires before tokenization, allowing plugins to pre-process the text.
212b9d7a615SAndreas Gohr
213b9d7a615SAndreas Gohr==== CollectionSearch and Terms ====
214b9d7a615SAndreas Gohr
2151148921dSAndreas GohrThe ''CollectionSearch'' class executes searches against any collection. Use ''addTerm()'' to register search terms, then call ''execute()''.
216b9d7a615SAndreas Gohr
217b9d7a615SAndreas Gohr<code php>
218b9d7a615SAndreas Gohr$search = new CollectionSearch($collection);
219b9d7a615SAndreas Gohr$search->addTerm('wiki*');
220b9d7a615SAndreas Gohr$terms = $search->execute();
221b9d7a615SAndreas Gohrforeach ($terms as $term) {
2221148921dSAndreas Gohr    $term->getEntityFrequencies(); // [entityName => totalFrequency, ...]
2231148921dSAndreas Gohr    $term->getEntityTokens();      // [entityName => [tokenName, ...], ...]
2241148921dSAndreas Gohr    $term->getMatches();           // [entityName => [tokenName => freq, ...], ...]
225b9d7a615SAndreas Gohr}
226b9d7a615SAndreas Gohr</code>
227b9d7a615SAndreas Gohr
2281148921dSAndreas Gohr''addTerm()'' returns a ''Term'' object. After ''execute()'', each Term holds the full match detail: which tokens matched on which entities with what frequencies. Various accessors provide different views on this data.
229b9d7a615SAndreas Gohr
230b9d7a615SAndreas GohrA ''Term'' represents a single search query component that can match one or more tokens in an index. Terms can include wildcards using the ''*'' character:
231b9d7a615SAndreas Gohr  * ''wiki'' - matches exactly "wiki"
232b9d7a615SAndreas Gohr  * ''wiki*'' - matches tokens starting with "wiki" (e.g., "wiki", "wikitext", "wikipedia")
233b9d7a615SAndreas Gohr  * ''*wiki'' - matches tokens ending with "wiki" (e.g., "wiki", "dokuwiki")
234b9d7a615SAndreas Gohr  * ''*wiki*'' - matches tokens containing "wiki" anywhere (e.g., "wiki", "dokuwiki", "wikitext")
235b9d7a615SAndreas Gohr
2361148921dSAndreas GohrMatching uses efficient string functions (''==='' for exact, ''str_starts_with''/''str_ends_with''/''str_contains'' for wildcards). For case-insensitive matching, call ''caseInsensitive()'' on the search or on individual terms. This is useful for metadata/title searches where indexed values preserve case (the fulltext token index is already lowercased by the Tokenizer).
2371148921dSAndreas Gohr
238b9d7a615SAndreas GohrTerms organize their matching tokens by length. This is crucial for working with split indexes: a term like ''*wiki*'' might match 4-letter words (wiki), 8-letter words (dokuwiki), and 9-letter words (wikilinks) but never 3-letter words, because the base term "wiki" is 4 letters long. Each length group can be looked up in the corresponding suffixed token index, allowing efficient searching without loading irrelevant files.
239b9d7a615SAndreas Gohr
240b9d7a615SAndreas GohrFor example, searching for ''wiki*'' might find:
2411148921dSAndreas Gohr  * Token "wiki" appears 5 times on page "start"
2421148921dSAndreas Gohr  * Token "wikitext" appears 3 times on page "start"
2431148921dSAndreas Gohr  * ''getEntityFrequencies()'' returns ''['start' => 8]''
2441148921dSAndreas Gohr  * ''getMatches()'' returns ''['start' => ['wiki' => 5, 'wikitext' => 3]]''
245b9d7a615SAndreas Gohr
246*2a22d4b9SAndreas GohrTerm does not enforce minimum token length. For fulltext search, callers should filter short words before calling ''addTerm()'' using ''Tokenizer::isValidSearchTerm($term)''.
247b9d7a615SAndreas Gohr
248b9d7a615SAndreas Gohr==== Fulltext Search Query Processing ====
249b9d7a615SAndreas Gohr
250b9d7a615SAndreas GohrFor fulltext searches a proper query language is supported (see [[:Search]]). Queries go through two stages:
251b9d7a615SAndreas Gohr
252b9d7a615SAndreas Gohr=== QueryParser ===
253b9d7a615SAndreas Gohr
254b9d7a615SAndreas Gohr''QueryParser::convert($query)'' parses a search query string into an intermediate representation. It supports:
255b9d7a615SAndreas Gohr
256b9d7a615SAndreas Gohr  * Individual words and phrases (quoted strings)
257b9d7a615SAndreas Gohr  * Namespace filtering with ''@ns:'' or ''ns:'' and ''-ns:'' for exclusion
258b9d7a615SAndreas Gohr  * Negation with ''-'' prefix
259b9d7a615SAndreas Gohr  * Boolean ''OR'' between terms
260b9d7a615SAndreas Gohr  * Grouping with parentheses
261b9d7a615SAndreas Gohr
262b9d7a615SAndreas GohrThe output includes an array in Reverse Polish Notation (RPN) used by the evaluator, plus extracted highlights, word lists, phrase lists, and namespace filters.
263b9d7a615SAndreas Gohr
264b9d7a615SAndreas Gohr=== QueryEvaluator ===
265b9d7a615SAndreas Gohr
266b9d7a615SAndreas Gohr''QueryEvaluator'' takes the RPN array and the ''Term'' results from ''CollectionSearch'' and evaluates the boolean logic. It uses typed stack entries during processing:
267b9d7a615SAndreas Gohr
268b9d7a615SAndreas Gohr  * **PageSet** - Concrete set of pages with scores. Supports intersect (AND), unite (OR), subtract (NOT).
269b9d7a615SAndreas Gohr  * **NamespacePredicate** - Lazy filter that only materializes when combined with a PageSet.
270b9d7a615SAndreas Gohr  * **NegatedEntry** - Wraps another entry to represent logical NOT, allowing AND to convert it to set subtraction.
271b9d7a615SAndreas Gohr
272b9d7a615SAndreas GohrThe result is a list of matching pages and their frequency scores.
273b9d7a615SAndreas Gohr
274b9d7a615SAndreas GohrPhrase verification reads the raw wiki text of candidate pages. Plugins can override this via [[devel:event:FULLTEXT_PHRASE_MATCH]].
275b9d7a615SAndreas Gohr
276b9d7a615SAndreas Gohr
277b9d7a615SAndreas Gohr==== Background Indexing ====
278b9d7a615SAndreas Gohr
279b9d7a615SAndreas GohrPages are indexed asynchronously by the [[:taskrunner|TaskRunner]] which is triggered after each page view. It calls ''Indexer::addPage()'' for pages that need re-indexing and ''Indexer::deletePage()'' for pages that no longer exist on disk. The CLI tool ''bin/indexer.php'' can be used to index all pages at once.
280b9d7a615SAndreas Gohr
281b9d7a615SAndreas GohrThe [[devel:event:INDEXER_TASKS_RUN]] event fires during background task execution, allowing plugins to hook their own maintenance tasks into the indexing cycle.
282b9d7a615SAndreas Gohr
283b9d7a615SAndreas Gohr===== Plugin Events =====
284b9d7a615SAndreas Gohr
285b9d7a615SAndreas GohrThe search system fires several events that plugins can use to extend or modify indexing and search behavior.
286b9d7a615SAndreas Gohr
287b9d7a615SAndreas GohrIndexing:
288b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_VERSION_GET]] - Plugins add their version to force re-indexing when the plugin changes.
289b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_PAGE_ADD]] - Modify page body, title, or metadata before it enters the index.
290b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_TEXT_PREPARE]] - Pre-process text before tokenization.
291b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_TASKS_RUN]] - Hook into the background task runner.
292b9d7a615SAndreas Gohr
293b9d7a615SAndreas GohrSearching:
294b9d7a615SAndreas Gohr  * [[devel:event:SEARCH_QUERY_FULLPAGE]] - Intercept or replace fulltext search.
295b9d7a615SAndreas Gohr  * [[devel:event:SEARCH_QUERY_PAGELOOKUP]] - Intercept or replace page name lookup.
296b9d7a615SAndreas Gohr  * [[devel:event:FULLTEXT_SNIPPET_CREATE]] - Provide custom search result snippets.
297b9d7a615SAndreas Gohr  * [[devel:event:FULLTEXT_PHRASE_MATCH]] - Override phrase matching logic.
298b9d7a615SAndreas Gohr
299b9d7a615SAndreas Gohr===== Exceptions =====
300b9d7a615SAndreas Gohr
301b9d7a615SAndreas GohrAll search-related exceptions extend ''SearchException'':
302b9d7a615SAndreas Gohr
303b9d7a615SAndreas Gohr  * ''SearchException'' - Base class for search/index errors
304b9d7a615SAndreas Gohr  * ''IndexAccessException'' - Failed to read an index file
305b9d7a615SAndreas Gohr  * ''IndexWriteException'' - Failed to write to an index file
306b9d7a615SAndreas Gohr  * ''IndexLockException'' - Failed to acquire or release a lock
307b9d7a615SAndreas Gohr  * ''IndexUsageException'' - Incorrect API usage (eg. writing without a lock)
308b9d7a615SAndreas Gohr  * ''IndexIntegrityException'' - Structural inconsistency detected in the indexes
309