xref: /dokuwiki/inc/Search/concept.txt (revision b9d7a61502960927577a8c9c2bee03484a94a17f)
1596d5287SAndreas Gohr====== Search Indexing ======
2596d5287SAndreas Gohr
3596d5287SAndreas GohrThe indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future.
4596d5287SAndreas Gohr
5*b9d7a615SAndreas Gohr===== Core API =====
6596d5287SAndreas Gohr
7*b9d7a615SAndreas GohrMost code interacting with the search index will use one of three high-level classes. All live in the ''\dokuwiki\Search'' namespace.
8596d5287SAndreas Gohr
9*b9d7a615SAndreas Gohr==== Indexer ====
10596d5287SAndreas Gohr
11*b9d7a615SAndreas GohrThe ''Indexer'' class manages the search index. It coordinates all collections and handles the actual writing.
12f2bbffb5SAndreas Gohr
13*b9d7a615SAndreas Gohr  * ''addPage($page, $force)'' - Add or re-index a page. Handles locking, tokenization, and metadata internally.
14*b9d7a615SAndreas Gohr  * ''deletePage($page, $force)'' - Remove a page from all indexes.
15*b9d7a615SAndreas Gohr  * ''renamePage($oldpage, $newpage)'' - Update the page name in the entity index.
16*b9d7a615SAndreas Gohr  * ''needsIndexing($page, $force)'' - Check whether a page needs (re-)indexing based on version and modification time.
17*b9d7a615SAndreas Gohr  * ''getAllPages($existsFilter)'' - Return all indexed page names. Optionally filter to only pages that exist on disk.
18*b9d7a615SAndreas Gohr  * ''getVersion()'' - Return the indexer version string, including plugin versions (see [[devel:event:INDEXER_VERSION_GET]]).
19*b9d7a615SAndreas Gohr  * ''clear()'' - Delete all index files.
20*b9d7a615SAndreas Gohr  * ''checkIntegrity()'' - Verify structural consistency across all indexes.
21*b9d7a615SAndreas Gohr  * ''setLogger($callback)'' - Register a logging callback for progress output.
22f2bbffb5SAndreas Gohr
23*b9d7a615SAndreas Gohr==== FulltextSearch ====
24596d5287SAndreas Gohr
25*b9d7a615SAndreas GohrThe ''FulltextSearch'' class handles fulltext search queries.
26596d5287SAndreas Gohr
27*b9d7a615SAndreas Gohr  * ''pageSearch($query, &$highlight, $sort, $after, $before)'' - Run a fulltext search. Returns matching pages as ''pageid => score''. The ''$highlight'' array is filled with terms to highlight. ''$sort'' can be ''"hits"'' (default) or ''"mtime"''. ''$after''/''$before'' filter by modification time.
28*b9d7a615SAndreas Gohr  * ''snippet($id, $highlight)'' - Generate a search result snippet for a page.
29f2bbffb5SAndreas Gohr
30*b9d7a615SAndreas Gohr==== MetadataSearch ====
31f2bbffb5SAndreas Gohr
32*b9d7a615SAndreas GohrThe ''MetadataSearch'' class provides search operations on page metadata.
33f2bbffb5SAndreas Gohr
34*b9d7a615SAndreas Gohr  * ''pageLookup($id, $in_ns, $in_title, $after, $before)'' - Quick search for page names. Optionally matches against the namespace and title.
35*b9d7a615SAndreas Gohr  * ''lookupKey($key, $value, $func)'' - Find pages by metadata value. Supports exact match, wildcards (''*'' at start/end) or a custom comparison callback. When ''$value'' is a string the result is a flat list of page names. When it is an array, the result is keyed by each search value.
36*b9d7a615SAndreas Gohr  * ''backlinks($id, $ignore_perms)'' - Find all pages linking to ''$id''.
37*b9d7a615SAndreas Gohr  * ''mediause($id, $ignore_perms)'' - Find all pages using a media file.
38*b9d7a615SAndreas Gohr  * ''getPages($key)'' - Return all indexed pages, optionally limited to those having a value for the given metadata key.
39596d5287SAndreas Gohr
40*b9d7a615SAndreas Gohr===== Internals =====
41596d5287SAndreas Gohr
42*b9d7a615SAndreas GohrThe following sections describe how the search index is structured and how the core classes work together to provide the indexing and search functionality. This is meant for developers who want to understand the inner workings of the search system or make use of it in their own plugins.
43596d5287SAndreas Gohr
44*b9d7a615SAndreas Gohr==== Indexes ====
45596d5287SAndreas Gohr
46596d5287SAndreas GohrIndexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies.
47596d5287SAndreas Gohr
48596d5287SAndreas GohrIndexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code.
49596d5287SAndreas Gohr
50*b9d7a615SAndreas GohrAll index files are stored in the ''data/index'' directory. The file name is the name of the index with an ''idx'' extension. For example, the page name index is stored in ''data/index/page.idx''. Some indexes have additional suffixes (eg. ''w3.idx'') to split the data into multiple files (see [[#Index File Splitting]] below).
51*b9d7a615SAndreas Gohr
52596d5287SAndreas GohrIndex files can be accessed through two classes:
53596d5287SAndreas Gohr
54*b9d7a615SAndreas Gohr  * ''\dokuwiki\Search\Index\FileIndex''
55*b9d7a615SAndreas Gohr  * ''\dokuwiki\Search\Index\MemoryIndex''
56596d5287SAndreas Gohr
57596d5287SAndreas GohrBoth classes expose the same API, the only difference is their way of accessing the data.
58596d5287SAndreas Gohr
59596d5287SAndreas GohrA FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index.
60596d5287SAndreas Gohr
61596d5287SAndreas GohrThe MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory.
62596d5287SAndreas Gohr
63596d5287SAndreas GohrWhich method to use depends mostly on the size of the file.
64596d5287SAndreas Gohr
65596d5287SAndreas GohrUsually indexes are not accessed directly but through a collection. That collection will manage which type of access to use.
66596d5287SAndreas Gohr
67596d5287SAndreas GohrWithin an index two kinds of data can be stored per row:
68596d5287SAndreas Gohr
69596d5287SAndreas Gohr  * A single value. Eg. an entity or a token
70596d5287SAndreas Gohr  * A list of tuples. Eg. a list of pageIDs and frequencies
71596d5287SAndreas Gohr
72*b9d7a615SAndreas GohrThe former is straight forward, it's a simple ''rid -> value'' store. The latter maps to ''rid -> [key -> value, ...]'' where key is usually the ''rid'' in another index.
73596d5287SAndreas Gohr
74*b9d7a615SAndreas Gohr=== Index File Splitting ===
758ae94493SAndreas Gohr
76*b9d7a615SAndreas GohrTo improve memory efficiency and access speed, token and frequency indexes can be split into multiple physical files using suffixes based on token length. A suffix parameter is appended to the base index name to create the actual filename. For example:
778ae94493SAndreas Gohr
788ae94493SAndreas Gohr  * Base name: ''w'' (for word tokens)
79f2bbffb5SAndreas Gohr  * Suffix: ''3'' (for 3-letter tokens)
808ae94493SAndreas Gohr  * Resulting file: ''w3.idx''
818ae94493SAndreas Gohr
82*b9d7a615SAndreas Gohr> Note: token lengths are counted in bytes, not characters. This means that for languages with multi-byte characters, the suffixes will reflect the byte length of the tokens, which may differ from the character count.
83*b9d7a615SAndreas Gohr
84f2bbffb5SAndreas GohrIn a fulltext collection with splitting enabled:
858ae94493SAndreas Gohr
86f2bbffb5SAndreas Gohr  * ''w3.idx'' / ''i3.idx'' - stores all 3-letter tokens and their frequencies
87f2bbffb5SAndreas Gohr  * ''w4.idx'' / ''i4.idx'' - stores all 4-letter tokens and their frequencies
88f2bbffb5SAndreas Gohr  * ''w5.idx'' / ''i5.idx'' - stores all 5-letter tokens and their frequencies
898ae94493SAndreas Gohr  * and so on...
908ae94493SAndreas Gohr
91f2bbffb5SAndreas GohrWhen splitting is disabled, a single file is used for each index (eg. ''relation_media_w.idx'').
92f2bbffb5SAndreas Gohr
938ae94493SAndreas GohrWhen an index uses suffixes, the ''max()'' method can be used to find the highest numeric suffix currently in use. This is useful for operations that need to iterate over all splits of an index (eg. when a Term is using a wildcard).
948ae94493SAndreas Gohr
95*b9d7a615SAndreas Gohr=== Tuple Data Format ===
968ae94493SAndreas Gohr
97*b9d7a615SAndreas GohrTuple-based index rows store associations between keys (typically ''rid''s from another index) and numeric values (typically frequency counts). The internal format uses a compact string representation:
988ae94493SAndreas Gohr
998ae94493SAndreas Gohr<code>
1008ae94493SAndreas Gohrkey*count:key*count:key*count
1018ae94493SAndreas Gohr</code>
1028ae94493SAndreas Gohr
1038ae94493SAndreas GohrWhere:
104*b9d7a615SAndreas Gohr  * ''key'' - Usually the ''rid'' from another index (e.g., a page ID)
1058ae94493SAndreas Gohr  * ''count'' - A numeric value (e.g., how many times a word appears on that page)
1068ae94493SAndreas Gohr  * '':'' - Separates individual tuples
1078ae94493SAndreas Gohr  * ''*'' - Separates the key from its count within a tuple
1088ae94493SAndreas Gohr
1098ae94493SAndreas Gohr**Example:** A frequency index row for a word might look like:
1108ae94493SAndreas Gohr<code>
1118ae94493SAndreas Gohr42*5:17*3:98*12
1128ae94493SAndreas Gohr</code>
1138ae94493SAndreas Gohr
1148ae94493SAndreas GohrThis means:
1158ae94493SAndreas Gohr  * Entity with RID 42 contains this word 5 times
1168ae94493SAndreas Gohr  * Entity with RID 17 contains this word 3 times
1178ae94493SAndreas Gohr  * Entity with RID 98 contains this word 12 times
1188ae94493SAndreas Gohr
1198ae94493SAndreas GohrFrequencies of 1 are not stored in the index. For example:
1208ae94493SAndreas Gohr
1218ae94493SAndreas Gohr<code>
1228ae94493SAndreas Gohr42*5:17:98
1238ae94493SAndreas Gohr</code>
1248ae94493SAndreas Gohr
1258ae94493SAndreas GohrIn the above case would be interpreted as
1268ae94493SAndreas Gohr
1278ae94493SAndreas Gohr  * Entity with RID 42 contains this word 5 times
1288ae94493SAndreas Gohr  * Entity with RID 17 contains this word 1 times
1298ae94493SAndreas Gohr  * Entity with RID 98 contains this word 1 times
1308ae94493SAndreas Gohr
1318ae94493SAndreas GohrThe ''TupleOps'' class provides utility methods for working with tuple records:
1328ae94493SAndreas Gohr  * ''updateTuple()'' - Insert or update a specific key->count pair
1338ae94493SAndreas Gohr  * ''parseTuples()'' - Parse a record into an array of key->count associations
1348ae94493SAndreas Gohr  * ''aggregateTupleCounts()'' - Sum all counts in a record
1358ae94493SAndreas Gohr
136*b9d7a615SAndreas Gohr==== Collections ====
137*b9d7a615SAndreas Gohr
138*b9d7a615SAndreas GohrA collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection.
139*b9d7a615SAndreas Gohr
140*b9d7a615SAndreas Gohr> Please note: because index has a specific meaning in our context (see above) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection.
141*b9d7a615SAndreas Gohr
142*b9d7a615SAndreas GohrA collection manages up to four indexes:
143*b9d7a615SAndreas Gohr
144*b9d7a615SAndreas Gohr  * **entity** - The main entity that will be the result of a search. Eg. a page. entity.RID -> entity
145*b9d7a615SAndreas Gohr  * **token** - The actual information strewn across the entities. Eg. words. token.RID -> token
146*b9d7a615SAndreas Gohr  * **frequency** - Maps tokens to entities and records their frequency. token.RID -> entity.RID*frequency:...
147*b9d7a615SAndreas Gohr  * **reverse** - Records which tokens are assigned to each entity. Used for updating: when an entity is re-indexed, the old reverse record provides the list of tokens to clean up.
148*b9d7a615SAndreas Gohr
149*b9d7a615SAndreas GohrThe reverse index format depends on whether the collection uses split indexes:
150*b9d7a615SAndreas Gohr  * **Split collections**: Each entry is a ''tokenLength*tokenId'' pair because the token length is needed to locate the correct split index file. Format: ''tokenLength*tokenId:tokenLength*tokenId:...''
151*b9d7a615SAndreas Gohr  * **Non-split collections**: Only the token ID is needed since all tokens live in a single file. Format: ''tokenId:tokenId:...''
152*b9d7a615SAndreas Gohr
153*b9d7a615SAndreas GohrCollections have two independent properties: a type and whether they use split indexes or not.
154*b9d7a615SAndreas Gohr
155*b9d7a615SAndreas GohrThe **collection type** determines how tokens relate to entities:
156*b9d7a615SAndreas Gohr
157*b9d7a615SAndreas Gohr  * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case.
158*b9d7a615SAndreas Gohr  * lookup collections - Basically the same as frequency collections, but each token appears only once per entity thus all frequencies are 1. Searches do not care for the frequency but are only interested if a token appears for the entity or not. Internally the same mechanisms are used; only the way tokens are processed on input differs (deduplication instead of counting).
159*b9d7a615SAndreas Gohr  * direct collections - Here a 1:1 relation between the entity and a token exists. For example a page has exactly one title. Direct collections only use entity and token index files (entity.RID === token.RID), no frequency or reverse indexes.
160*b9d7a615SAndreas Gohr
161*b9d7a615SAndreas GohrIndependently of the collection type, a collection can use **split or non-split token indexes**. See the [[#Index File Splitting]] section above.
162*b9d7a615SAndreas Gohr
163*b9d7a615SAndreas Gohr^ Name                   ^ Type      ^ Split? ^ Entity ^ Token                 ^ Frequency             ^ Reverse               ^
164*b9d7a615SAndreas Gohr| FullText               | frequency | yes    | page   | w*                    | i*                    | pageword              |
165*b9d7a615SAndreas Gohr| Title                  | direct    | no     | page   | title                 | -                     | -                     |
166*b9d7a615SAndreas Gohr| MetaRelationMedia      | lookup    | no     | page   | relation_media_w      | relation_media_i      | relation_media_p      |
167*b9d7a615SAndreas Gohr| MetaRelationReferences | lookup    | no     | page   | relation_references_w | relation_references_i | relation_references_p |
168*b9d7a615SAndreas Gohr
169*b9d7a615SAndreas Gohr=== Writing data ===
170*b9d7a615SAndreas Gohr
171*b9d7a615SAndreas Gohr''addEntity($entity, $tokens)'' is the main method for writing data to a collection. It replaces all previously stored tokens for the given entity. An empty token list removes the entity's data. The collection must be locked before calling this method.
172*b9d7a615SAndreas Gohr
173*b9d7a615SAndreas Gohr<code php>
174*b9d7a615SAndreas Gohr$collection = new PageFulltextCollection($pageIndex);
175*b9d7a615SAndreas Gohr$collection->lock();
176*b9d7a615SAndreas Gohr$collection->addEntity('wiki:page', $words);
177*b9d7a615SAndreas Gohr$collection->unlock();
178*b9d7a615SAndreas Gohr</code>
179*b9d7a615SAndreas Gohr
180*b9d7a615SAndreas GohrInternally, ''addEntity()'' reads the reverse index to find the entity's old tokens, resolves the new tokens to IDs (creating them in the token index if needed), merges old and new, and updates the frequency and reverse indexes accordingly. Tokens no longer present are automatically removed from the frequency index.
181*b9d7a615SAndreas Gohr
182*b9d7a615SAndreas GohrFor direct collections, ''addEntity()'' simply writes the first token at the entity's position in the token index.
183*b9d7a615SAndreas Gohr
184*b9d7a615SAndreas Gohr=== Reading data ===
185*b9d7a615SAndreas Gohr
186*b9d7a615SAndreas GohrCollections provide some basic information retrieval methods, but they are not meant for searching.
187*b9d7a615SAndreas Gohr
188*b9d7a615SAndreas Gohr  * ''getEntitiesWithData()'' - Return all entity names that have data in this collection.
189*b9d7a615SAndreas Gohr  * For direct collections, ''getToken($entity)'' retrieves the single token stored for an entity (eg. a page title).
190*b9d7a615SAndreas Gohr
191*b9d7a615SAndreas GohrSearching across a collection is done through the ''CollectionSearch'' class (see below).
192*b9d7a615SAndreas Gohr
193*b9d7a615SAndreas Gohr==== Locking ====
194596d5287SAndreas Gohr
195596d5287SAndreas GohrOnly one process may write to an index at any time. To ensure this, a locking mechanism has to be employed.
196596d5287SAndreas Gohr
197*b9d7a615SAndreas GohrIndexes are opened in readonly mode by default. Passing ''$isWritable = true'' to the constructor (or calling ''lock()'' later) acquires a lock and enables writing. Calling ''unlock()'' releases it.
198596d5287SAndreas Gohr
199*b9d7a615SAndreas GohrThe ''Lock'' class is a static registry with reference counting. ''Lock::acquire($name)'' creates a filesystem lock directory. Multiple calls within the same process share a single lock via reference counting. ''Lock::release($name)'' decrements the count and removes the directory when it reaches zero. Stale locks older than 5 minutes are automatically broken.
200*b9d7a615SAndreas Gohr
201*b9d7a615SAndreas GohrCollections call ''lock()'' to acquire locks for all their indexes at once, and ''unlock()'' to release them.
202*b9d7a615SAndreas Gohr
203*b9d7a615SAndreas Gohr==== Tokenizer ====
204*b9d7a615SAndreas Gohr
205*b9d7a615SAndreas GohrThe ''Tokenizer'' class (in ''\dokuwiki\Search'') is responsible for splitting text into indexable tokens.
206*b9d7a615SAndreas Gohr
207*b9d7a615SAndreas Gohr''Tokenizer::getWords($text, $wc)'' splits the given text into an array of lowercase tokens. Tokens shorter than the minimum word length (default 2, configurable via ''IDX_MINWORDLENGTH'') are discarded, as are language-specific stop words loaded from ''inc/lang/<lang>/stopwords.txt''. Asian characters receive special treatment: they are separated into individual characters and measured with a length function that accounts for multi-byte sequences.
208*b9d7a615SAndreas Gohr
209*b9d7a615SAndreas GohrWhen ''$wc'' is true, wildcard characters (''*'') are preserved in the output. This is used by the query parser.
210*b9d7a615SAndreas Gohr
211*b9d7a615SAndreas GohrThe [[devel:event:INDEXER_TEXT_PREPARE]] event fires before tokenization, allowing plugins to pre-process the text.
212*b9d7a615SAndreas Gohr
213*b9d7a615SAndreas Gohr==== CollectionSearch and Terms ====
214*b9d7a615SAndreas Gohr
215*b9d7a615SAndreas GohrThe ''CollectionSearch'' class executes searches against any collection. It provides two APIs:
216*b9d7a615SAndreas Gohr
217*b9d7a615SAndreas Gohr=== Term-based search ===
218*b9d7a615SAndreas Gohr
219*b9d7a615SAndreas GohrUsed by fulltext search. Terms are validated against the minimum token length.
220*b9d7a615SAndreas Gohr
221*b9d7a615SAndreas Gohr<code php>
222*b9d7a615SAndreas Gohr$search = new CollectionSearch($collection);
223*b9d7a615SAndreas Gohr$search->addTerm('wiki*');
224*b9d7a615SAndreas Gohr$terms = $search->execute();
225*b9d7a615SAndreas Gohrforeach ($terms as $term) {
226*b9d7a615SAndreas Gohr    // $term->getEntityFrequencies() returns [entityName => frequency, ...]
227*b9d7a615SAndreas Gohr}
228*b9d7a615SAndreas Gohr</code>
229*b9d7a615SAndreas Gohr
230*b9d7a615SAndreas Gohr''addTerm()'' returns a ''Term'' object. After ''execute()'', each Term holds the matching entities and their aggregated frequencies.
231*b9d7a615SAndreas Gohr
232*b9d7a615SAndreas GohrA ''Term'' represents a single search query component that can match one or more tokens in an index. Terms can include wildcards using the ''*'' character:
233*b9d7a615SAndreas Gohr  * ''wiki'' - matches exactly "wiki"
234*b9d7a615SAndreas Gohr  * ''wiki*'' - matches tokens starting with "wiki" (e.g., "wiki", "wikitext", "wikipedia")
235*b9d7a615SAndreas Gohr  * ''*wiki'' - matches tokens ending with "wiki" (e.g., "wiki", "dokuwiki")
236*b9d7a615SAndreas Gohr  * ''*wiki*'' - matches tokens containing "wiki" anywhere (e.g., "wiki", "dokuwiki", "wikitext")
237*b9d7a615SAndreas Gohr
238*b9d7a615SAndreas GohrTerms organize their matching tokens by length. This is crucial for working with split indexes: a term like ''*wiki*'' might match 4-letter words (wiki), 8-letter words (dokuwiki), and 9-letter words (wikilinks) but never 3-letter words, because the base term "wiki" is 4 letters long. Each length group can be looked up in the corresponding suffixed token index, allowing efficient searching without loading irrelevant files.
239*b9d7a615SAndreas Gohr
240*b9d7a615SAndreas GohrDuring a search operation, Terms:
241*b9d7a615SAndreas Gohr  - Collect all token IDs that match the term pattern (organized by token length)
242*b9d7a615SAndreas Gohr  - Look up which entities contain those tokens
243*b9d7a615SAndreas Gohr  - Aggregate the frequencies across all matching tokens
244*b9d7a615SAndreas Gohr  - Map entity IDs to entity names for the final result
245*b9d7a615SAndreas Gohr
246*b9d7a615SAndreas GohrFor example, searching for ''wiki*'' might find:
247*b9d7a615SAndreas Gohr  * Token "wiki" (ID 42) appears 5 times on page "start" (ID 10)
248*b9d7a615SAndreas Gohr  * Token "wikitext" (ID 87) appears 3 times on page "start" (ID 10)
249*b9d7a615SAndreas Gohr  * Term result: "start" matches with total frequency 8
250*b9d7a615SAndreas Gohr
251*b9d7a615SAndreas GohrTerms are validated on creation. The base term (without wildcards) must meet the minimum token length configured in the Tokenizer. Numeric terms are exempt from this check. Terms that are too short throw a SearchException.
252*b9d7a615SAndreas Gohr
253*b9d7a615SAndreas Gohr=== Lookup search ===
254*b9d7a615SAndreas Gohr
255*b9d7a615SAndreas GohrUsed by metadata search. No minimum length restrictions. Supports exact match, wildcards, and custom callbacks.
256*b9d7a615SAndreas Gohr
257*b9d7a615SAndreas Gohr<code php>
258*b9d7a615SAndreas Gohr$search = new CollectionSearch($collection);
259*b9d7a615SAndreas Gohr$result = $search->lookup(['targetpage', 'other*'], $callbackOrNull);
260*b9d7a615SAndreas Gohr// $result = ['targetpage' => ['page1', 'page2'], 'other*' => ['page3']]
261*b9d7a615SAndreas Gohr</code>
262*b9d7a615SAndreas Gohr
263*b9d7a615SAndreas Gohr==== Fulltext Search Query Processing ====
264*b9d7a615SAndreas Gohr
265*b9d7a615SAndreas GohrFor fulltext searches a proper query language is supported (see [[:Search]]). Queries go through two stages:
266*b9d7a615SAndreas Gohr
267*b9d7a615SAndreas Gohr=== QueryParser ===
268*b9d7a615SAndreas Gohr
269*b9d7a615SAndreas Gohr''QueryParser::convert($query)'' parses a search query string into an intermediate representation. It supports:
270*b9d7a615SAndreas Gohr
271*b9d7a615SAndreas Gohr  * Individual words and phrases (quoted strings)
272*b9d7a615SAndreas Gohr  * Namespace filtering with ''@ns:'' or ''ns:'' and ''-ns:'' for exclusion
273*b9d7a615SAndreas Gohr  * Negation with ''-'' prefix
274*b9d7a615SAndreas Gohr  * Boolean ''OR'' between terms
275*b9d7a615SAndreas Gohr  * Grouping with parentheses
276*b9d7a615SAndreas Gohr
277*b9d7a615SAndreas GohrThe output includes an array in Reverse Polish Notation (RPN) used by the evaluator, plus extracted highlights, word lists, phrase lists, and namespace filters.
278*b9d7a615SAndreas Gohr
279*b9d7a615SAndreas Gohr=== QueryEvaluator ===
280*b9d7a615SAndreas Gohr
281*b9d7a615SAndreas Gohr''QueryEvaluator'' takes the RPN array and the ''Term'' results from ''CollectionSearch'' and evaluates the boolean logic. It uses typed stack entries during processing:
282*b9d7a615SAndreas Gohr
283*b9d7a615SAndreas Gohr  * **PageSet** - Concrete set of pages with scores. Supports intersect (AND), unite (OR), subtract (NOT).
284*b9d7a615SAndreas Gohr  * **NamespacePredicate** - Lazy filter that only materializes when combined with a PageSet.
285*b9d7a615SAndreas Gohr  * **NegatedEntry** - Wraps another entry to represent logical NOT, allowing AND to convert it to set subtraction.
286*b9d7a615SAndreas Gohr
287*b9d7a615SAndreas GohrThe result is a list of matching pages and their frequency scores.
288*b9d7a615SAndreas Gohr
289*b9d7a615SAndreas GohrPhrase verification reads the raw wiki text of candidate pages. Plugins can override this via [[devel:event:FULLTEXT_PHRASE_MATCH]].
290*b9d7a615SAndreas Gohr
291*b9d7a615SAndreas Gohr
292*b9d7a615SAndreas Gohr==== Background Indexing ====
293*b9d7a615SAndreas Gohr
294*b9d7a615SAndreas GohrPages are indexed asynchronously by the [[:taskrunner|TaskRunner]] which is triggered after each page view. It calls ''Indexer::addPage()'' for pages that need re-indexing and ''Indexer::deletePage()'' for pages that no longer exist on disk. The CLI tool ''bin/indexer.php'' can be used to index all pages at once.
295*b9d7a615SAndreas Gohr
296*b9d7a615SAndreas GohrThe [[devel:event:INDEXER_TASKS_RUN]] event fires during background task execution, allowing plugins to hook their own maintenance tasks into the indexing cycle.
297*b9d7a615SAndreas Gohr
298*b9d7a615SAndreas Gohr===== Plugin Events =====
299*b9d7a615SAndreas Gohr
300*b9d7a615SAndreas GohrThe search system fires several events that plugins can use to extend or modify indexing and search behavior.
301*b9d7a615SAndreas Gohr
302*b9d7a615SAndreas GohrIndexing:
303*b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_VERSION_GET]] - Plugins add their version to force re-indexing when the plugin changes.
304*b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_PAGE_ADD]] - Modify page body, title, or metadata before it enters the index.
305*b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_TEXT_PREPARE]] - Pre-process text before tokenization.
306*b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_TASKS_RUN]] - Hook into the background task runner.
307*b9d7a615SAndreas Gohr
308*b9d7a615SAndreas GohrSearching:
309*b9d7a615SAndreas Gohr  * [[devel:event:SEARCH_QUERY_FULLPAGE]] - Intercept or replace fulltext search.
310*b9d7a615SAndreas Gohr  * [[devel:event:SEARCH_QUERY_PAGELOOKUP]] - Intercept or replace page name lookup.
311*b9d7a615SAndreas Gohr  * [[devel:event:FULLTEXT_SNIPPET_CREATE]] - Provide custom search result snippets.
312*b9d7a615SAndreas Gohr  * [[devel:event:FULLTEXT_PHRASE_MATCH]] - Override phrase matching logic.
313*b9d7a615SAndreas Gohr
314*b9d7a615SAndreas Gohr===== Exceptions =====
315*b9d7a615SAndreas Gohr
316*b9d7a615SAndreas GohrAll search-related exceptions extend ''SearchException'':
317*b9d7a615SAndreas Gohr
318*b9d7a615SAndreas Gohr  * ''SearchException'' - Base class for search/index errors
319*b9d7a615SAndreas Gohr  * ''IndexAccessException'' - Failed to read an index file
320*b9d7a615SAndreas Gohr  * ''IndexWriteException'' - Failed to write to an index file
321*b9d7a615SAndreas Gohr  * ''IndexLockException'' - Failed to acquire or release a lock
322*b9d7a615SAndreas Gohr  * ''IndexUsageException'' - Incorrect API usage (eg. writing without a lock)
323*b9d7a615SAndreas Gohr  * ''IndexIntegrityException'' - Structural inconsistency detected in the indexes
324