1596d5287SAndreas Gohr====== Search Indexing ====== 2596d5287SAndreas Gohr 3596d5287SAndreas GohrThe indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future. 4596d5287SAndreas Gohr 5b9d7a615SAndreas Gohr===== Core API ===== 6596d5287SAndreas Gohr 7b9d7a615SAndreas GohrMost code interacting with the search index will use one of three high-level classes. All live in the ''\dokuwiki\Search'' namespace. 8596d5287SAndreas Gohr 9b9d7a615SAndreas Gohr==== Indexer ==== 10596d5287SAndreas Gohr 11b9d7a615SAndreas GohrThe ''Indexer'' class manages the search index. It coordinates all collections and handles the actual writing. 12f2bbffb5SAndreas Gohr 13b9d7a615SAndreas Gohr * ''addPage($page, $force)'' - Add or re-index a page. Handles locking, tokenization, and metadata internally. 14b9d7a615SAndreas Gohr * ''deletePage($page, $force)'' - Remove a page from all indexes. 15b9d7a615SAndreas Gohr * ''renamePage($oldpage, $newpage)'' - Update the page name in the entity index. 16b9d7a615SAndreas Gohr * ''needsIndexing($page, $force)'' - Check whether a page needs (re-)indexing based on version and modification time. 17b9d7a615SAndreas Gohr * ''getAllPages($existsFilter)'' - Return all indexed page names. Optionally filter to only pages that exist on disk. 18b9d7a615SAndreas Gohr * ''getVersion()'' - Return the indexer version string, including plugin versions (see [[devel:event:INDEXER_VERSION_GET]]). 19b9d7a615SAndreas Gohr * ''clear()'' - Delete all index files. 20b9d7a615SAndreas Gohr * ''checkIntegrity()'' - Verify structural consistency across all indexes. 21b9d7a615SAndreas Gohr * ''setLogger($callback)'' - Register a logging callback for progress output. 22f2bbffb5SAndreas Gohr 23b9d7a615SAndreas Gohr==== FulltextSearch ==== 24596d5287SAndreas Gohr 25b9d7a615SAndreas GohrThe ''FulltextSearch'' class handles fulltext search queries. 26596d5287SAndreas Gohr 27b9d7a615SAndreas Gohr * ''pageSearch($query, &$highlight, $sort, $after, $before)'' - Run a fulltext search. Returns matching pages as ''pageid => score''. The ''$highlight'' array is filled with terms to highlight. ''$sort'' can be ''"hits"'' (default) or ''"mtime"''. ''$after''/''$before'' filter by modification time. 28b9d7a615SAndreas Gohr * ''snippet($id, $highlight)'' - Generate a search result snippet for a page. 29f2bbffb5SAndreas Gohr 30b9d7a615SAndreas Gohr==== MetadataSearch ==== 31f2bbffb5SAndreas Gohr 32b9d7a615SAndreas GohrThe ''MetadataSearch'' class provides search operations on page metadata. 33f2bbffb5SAndreas Gohr 34b9d7a615SAndreas Gohr * ''pageLookup($id, $in_ns, $in_title, $after, $before)'' - Quick search for page names. Optionally matches against the namespace and title. 351148921dSAndreas Gohr * ''lookupKey($key, $value)'' - Find pages by metadata value. Supports exact match and wildcards (''*'' at start/end). When ''$value'' is a string the result is a flat list of page names. When it is an array, the result is keyed by each search value. 36b9d7a615SAndreas Gohr * ''backlinks($id, $ignore_perms)'' - Find all pages linking to ''$id''. 37b9d7a615SAndreas Gohr * ''mediause($id, $ignore_perms)'' - Find all pages using a media file. 38b9d7a615SAndreas Gohr * ''getPages($key)'' - Return all indexed pages, optionally limited to those having a value for the given metadata key. 39596d5287SAndreas Gohr 40b9d7a615SAndreas Gohr===== Internals ===== 41596d5287SAndreas Gohr 42b9d7a615SAndreas GohrThe following sections describe how the search index is structured and how the core classes work together to provide the indexing and search functionality. This is meant for developers who want to understand the inner workings of the search system or make use of it in their own plugins. 43596d5287SAndreas Gohr 44b9d7a615SAndreas Gohr==== Indexes ==== 45596d5287SAndreas Gohr 46596d5287SAndreas GohrIndexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies. 47596d5287SAndreas Gohr 48596d5287SAndreas GohrIndexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code. 49596d5287SAndreas Gohr 50b9d7a615SAndreas GohrAll index files are stored in the ''data/index'' directory. The file name is the name of the index with an ''idx'' extension. For example, the page name index is stored in ''data/index/page.idx''. Some indexes have additional suffixes (eg. ''w3.idx'') to split the data into multiple files (see [[#Index File Splitting]] below). 51b9d7a615SAndreas Gohr 52596d5287SAndreas GohrIndex files can be accessed through two classes: 53596d5287SAndreas Gohr 54b9d7a615SAndreas Gohr * ''\dokuwiki\Search\Index\FileIndex'' 55b9d7a615SAndreas Gohr * ''\dokuwiki\Search\Index\MemoryIndex'' 56596d5287SAndreas Gohr 57596d5287SAndreas GohrBoth classes expose the same API, the only difference is their way of accessing the data. 58596d5287SAndreas Gohr 59596d5287SAndreas GohrA FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index. 60596d5287SAndreas Gohr 61*db8be586SAndreas GohrThe MemoryIndex loads the whole file into an internal array. Changes can be saved explicitly via the ''save()'' method and are also auto-saved when the index is unlocked or destroyed. This prevents data loss when indexes are used in tandem, where a new RID in one index may already be referenced by another. A memory index is faster but requires more memory. 62596d5287SAndreas Gohr 63596d5287SAndreas GohrWhich method to use depends mostly on the size of the file. 64596d5287SAndreas Gohr 65596d5287SAndreas GohrUsually indexes are not accessed directly but through a collection. That collection will manage which type of access to use. 66596d5287SAndreas Gohr 67596d5287SAndreas GohrWithin an index two kinds of data can be stored per row: 68596d5287SAndreas Gohr 69596d5287SAndreas Gohr * A single value. Eg. an entity or a token 70596d5287SAndreas Gohr * A list of tuples. Eg. a list of pageIDs and frequencies 71596d5287SAndreas Gohr 72b9d7a615SAndreas GohrThe former is straight forward, it's a simple ''rid -> value'' store. The latter maps to ''rid -> [key -> value, ...]'' where key is usually the ''rid'' in another index. 73596d5287SAndreas Gohr 74b9d7a615SAndreas Gohr=== Index File Splitting === 758ae94493SAndreas Gohr 76b9d7a615SAndreas GohrTo improve memory efficiency and access speed, token and frequency indexes can be split into multiple physical files using suffixes based on token length. A suffix parameter is appended to the base index name to create the actual filename. For example: 778ae94493SAndreas Gohr 788ae94493SAndreas Gohr * Base name: ''w'' (for word tokens) 79f2bbffb5SAndreas Gohr * Suffix: ''3'' (for 3-letter tokens) 808ae94493SAndreas Gohr * Resulting file: ''w3.idx'' 818ae94493SAndreas Gohr 82b9d7a615SAndreas Gohr> Note: token lengths are counted in bytes, not characters. This means that for languages with multi-byte characters, the suffixes will reflect the byte length of the tokens, which may differ from the character count. 83b9d7a615SAndreas Gohr 84f2bbffb5SAndreas GohrIn a fulltext collection with splitting enabled: 858ae94493SAndreas Gohr 86f2bbffb5SAndreas Gohr * ''w3.idx'' / ''i3.idx'' - stores all 3-letter tokens and their frequencies 87f2bbffb5SAndreas Gohr * ''w4.idx'' / ''i4.idx'' - stores all 4-letter tokens and their frequencies 88f2bbffb5SAndreas Gohr * ''w5.idx'' / ''i5.idx'' - stores all 5-letter tokens and their frequencies 898ae94493SAndreas Gohr * and so on... 908ae94493SAndreas Gohr 91f2bbffb5SAndreas GohrWhen splitting is disabled, a single file is used for each index (eg. ''relation_media_w.idx''). 92f2bbffb5SAndreas Gohr 938ae94493SAndreas GohrWhen an index uses suffixes, the ''max()'' method can be used to find the highest numeric suffix currently in use. This is useful for operations that need to iterate over all splits of an index (eg. when a Term is using a wildcard). 948ae94493SAndreas Gohr 95b9d7a615SAndreas Gohr=== Tuple Data Format === 968ae94493SAndreas Gohr 97b9d7a615SAndreas GohrTuple-based index rows store associations between keys (typically ''rid''s from another index) and numeric values (typically frequency counts). The internal format uses a compact string representation: 988ae94493SAndreas Gohr 998ae94493SAndreas Gohr<code> 1008ae94493SAndreas Gohrkey*count:key*count:key*count 1018ae94493SAndreas Gohr</code> 1028ae94493SAndreas Gohr 1038ae94493SAndreas GohrWhere: 104b9d7a615SAndreas Gohr * ''key'' - Usually the ''rid'' from another index (e.g., a page ID) 1058ae94493SAndreas Gohr * ''count'' - A numeric value (e.g., how many times a word appears on that page) 1068ae94493SAndreas Gohr * '':'' - Separates individual tuples 1078ae94493SAndreas Gohr * ''*'' - Separates the key from its count within a tuple 1088ae94493SAndreas Gohr 1098ae94493SAndreas Gohr**Example:** A frequency index row for a word might look like: 1108ae94493SAndreas Gohr<code> 1118ae94493SAndreas Gohr42*5:17*3:98*12 1128ae94493SAndreas Gohr</code> 1138ae94493SAndreas Gohr 1148ae94493SAndreas GohrThis means: 1158ae94493SAndreas Gohr * Entity with RID 42 contains this word 5 times 1168ae94493SAndreas Gohr * Entity with RID 17 contains this word 3 times 1178ae94493SAndreas Gohr * Entity with RID 98 contains this word 12 times 1188ae94493SAndreas Gohr 1198ae94493SAndreas GohrFrequencies of 1 are not stored in the index. For example: 1208ae94493SAndreas Gohr 1218ae94493SAndreas Gohr<code> 1228ae94493SAndreas Gohr42*5:17:98 1238ae94493SAndreas Gohr</code> 1248ae94493SAndreas Gohr 1258ae94493SAndreas GohrIn the above case would be interpreted as 1268ae94493SAndreas Gohr 1278ae94493SAndreas Gohr * Entity with RID 42 contains this word 5 times 1288ae94493SAndreas Gohr * Entity with RID 17 contains this word 1 times 1298ae94493SAndreas Gohr * Entity with RID 98 contains this word 1 times 1308ae94493SAndreas Gohr 1318ae94493SAndreas GohrThe ''TupleOps'' class provides utility methods for working with tuple records: 1328ae94493SAndreas Gohr * ''updateTuple()'' - Insert or update a specific key->count pair 1338ae94493SAndreas Gohr * ''parseTuples()'' - Parse a record into an array of key->count associations 1348ae94493SAndreas Gohr * ''aggregateTupleCounts()'' - Sum all counts in a record 1358ae94493SAndreas Gohr 136b9d7a615SAndreas Gohr==== Collections ==== 137b9d7a615SAndreas Gohr 138b9d7a615SAndreas GohrA collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection. 139b9d7a615SAndreas Gohr 140b9d7a615SAndreas Gohr> Please note: because index has a specific meaning in our context (see above) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection. 141b9d7a615SAndreas Gohr 142b9d7a615SAndreas GohrA collection manages up to four indexes: 143b9d7a615SAndreas Gohr 144b9d7a615SAndreas Gohr * **entity** - The main entity that will be the result of a search. Eg. a page. entity.RID -> entity 145b9d7a615SAndreas Gohr * **token** - The actual information strewn across the entities. Eg. words. token.RID -> token 146b9d7a615SAndreas Gohr * **frequency** - Maps tokens to entities and records their frequency. token.RID -> entity.RID*frequency:... 147b9d7a615SAndreas Gohr * **reverse** - Records which tokens are assigned to each entity. Used for updating: when an entity is re-indexed, the old reverse record provides the list of tokens to clean up. 148b9d7a615SAndreas Gohr 149b9d7a615SAndreas GohrThe reverse index format depends on whether the collection uses split indexes: 150b9d7a615SAndreas Gohr * **Split collections**: Each entry is a ''tokenLength*tokenId'' pair because the token length is needed to locate the correct split index file. Format: ''tokenLength*tokenId:tokenLength*tokenId:...'' 151b9d7a615SAndreas Gohr * **Non-split collections**: Only the token ID is needed since all tokens live in a single file. Format: ''tokenId:tokenId:...'' 152b9d7a615SAndreas Gohr 153b9d7a615SAndreas GohrCollections have two independent properties: a type and whether they use split indexes or not. 154b9d7a615SAndreas Gohr 155b9d7a615SAndreas GohrThe **collection type** determines how tokens relate to entities: 156b9d7a615SAndreas Gohr 157b9d7a615SAndreas Gohr * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case. 158b9d7a615SAndreas Gohr * lookup collections - Basically the same as frequency collections, but each token appears only once per entity thus all frequencies are 1. Searches do not care for the frequency but are only interested if a token appears for the entity or not. Internally the same mechanisms are used; only the way tokens are processed on input differs (deduplication instead of counting). 159b9d7a615SAndreas Gohr * direct collections - Here a 1:1 relation between the entity and a token exists. For example a page has exactly one title. Direct collections only use entity and token index files (entity.RID === token.RID), no frequency or reverse indexes. 160b9d7a615SAndreas Gohr 161b9d7a615SAndreas GohrIndependently of the collection type, a collection can use **split or non-split token indexes**. See the [[#Index File Splitting]] section above. 162b9d7a615SAndreas Gohr 163b9d7a615SAndreas Gohr^ Name ^ Type ^ Split? ^ Entity ^ Token ^ Frequency ^ Reverse ^ 164b9d7a615SAndreas Gohr| FullText | frequency | yes | page | w* | i* | pageword | 165b9d7a615SAndreas Gohr| Title | direct | no | page | title | - | - | 166b9d7a615SAndreas Gohr| MetaRelationMedia | lookup | no | page | relation_media_w | relation_media_i | relation_media_p | 167b9d7a615SAndreas Gohr| MetaRelationReferences | lookup | no | page | relation_references_w | relation_references_i | relation_references_p | 168b9d7a615SAndreas Gohr 169b9d7a615SAndreas Gohr=== Writing data === 170b9d7a615SAndreas Gohr 171b9d7a615SAndreas Gohr''addEntity($entity, $tokens)'' is the main method for writing data to a collection. It replaces all previously stored tokens for the given entity. An empty token list removes the entity's data. The collection must be locked before calling this method. 172b9d7a615SAndreas Gohr 173b9d7a615SAndreas Gohr<code php> 174b9d7a615SAndreas Gohr$collection = new PageFulltextCollection($pageIndex); 175b9d7a615SAndreas Gohr$collection->lock(); 176b9d7a615SAndreas Gohr$collection->addEntity('wiki:page', $words); 177b9d7a615SAndreas Gohr$collection->unlock(); 178b9d7a615SAndreas Gohr</code> 179b9d7a615SAndreas Gohr 180b9d7a615SAndreas GohrInternally, ''addEntity()'' reads the reverse index to find the entity's old tokens, resolves the new tokens to IDs (creating them in the token index if needed), merges old and new, and updates the frequency and reverse indexes accordingly. Tokens no longer present are automatically removed from the frequency index. 181b9d7a615SAndreas Gohr 182b9d7a615SAndreas GohrFor direct collections, ''addEntity()'' simply writes the first token at the entity's position in the token index. 183b9d7a615SAndreas Gohr 184b9d7a615SAndreas Gohr=== Reading data === 185b9d7a615SAndreas Gohr 186b9d7a615SAndreas GohrCollections provide some basic information retrieval methods, but they are not meant for searching. 187b9d7a615SAndreas Gohr 188b9d7a615SAndreas Gohr * ''getEntitiesWithData()'' - Return all entity names that have data in this collection. 189b9d7a615SAndreas Gohr * For direct collections, ''getToken($entity)'' retrieves the single token stored for an entity (eg. a page title). 190b9d7a615SAndreas Gohr 191b9d7a615SAndreas GohrSearching across a collection is done through the ''CollectionSearch'' class (see below). 192b9d7a615SAndreas Gohr 193b9d7a615SAndreas Gohr==== Locking ==== 194596d5287SAndreas Gohr 195596d5287SAndreas GohrOnly one process may write to an index at any time. To ensure this, a locking mechanism has to be employed. 196596d5287SAndreas Gohr 197b9d7a615SAndreas GohrIndexes are opened in readonly mode by default. Passing ''$isWritable = true'' to the constructor (or calling ''lock()'' later) acquires a lock and enables writing. Calling ''unlock()'' releases it. 198596d5287SAndreas Gohr 199b9d7a615SAndreas GohrThe ''Lock'' class is a static registry with reference counting. ''Lock::acquire($name)'' creates a filesystem lock directory. Multiple calls within the same process share a single lock via reference counting. ''Lock::release($name)'' decrements the count and removes the directory when it reaches zero. Stale locks older than 5 minutes are automatically broken. 200b9d7a615SAndreas Gohr 201b9d7a615SAndreas GohrCollections call ''lock()'' to acquire locks for all their indexes at once, and ''unlock()'' to release them. 202b9d7a615SAndreas Gohr 203b9d7a615SAndreas Gohr==== Tokenizer ==== 204b9d7a615SAndreas Gohr 205b9d7a615SAndreas GohrThe ''Tokenizer'' class (in ''\dokuwiki\Search'') is responsible for splitting text into indexable tokens. 206b9d7a615SAndreas Gohr 207b9d7a615SAndreas Gohr''Tokenizer::getWords($text, $wc)'' splits the given text into an array of lowercase tokens. Tokens shorter than the minimum word length (default 2, configurable via ''IDX_MINWORDLENGTH'') are discarded, as are language-specific stop words loaded from ''inc/lang/<lang>/stopwords.txt''. Asian characters receive special treatment: they are separated into individual characters and measured with a length function that accounts for multi-byte sequences. 208b9d7a615SAndreas Gohr 209b9d7a615SAndreas GohrWhen ''$wc'' is true, wildcard characters (''*'') are preserved in the output. This is used by the query parser. 210b9d7a615SAndreas Gohr 211b9d7a615SAndreas GohrThe [[devel:event:INDEXER_TEXT_PREPARE]] event fires before tokenization, allowing plugins to pre-process the text. 212b9d7a615SAndreas Gohr 213b9d7a615SAndreas Gohr==== CollectionSearch and Terms ==== 214b9d7a615SAndreas Gohr 2151148921dSAndreas GohrThe ''CollectionSearch'' class executes searches against any collection. Use ''addTerm()'' to register search terms, then call ''execute()''. 216b9d7a615SAndreas Gohr 217b9d7a615SAndreas Gohr<code php> 218b9d7a615SAndreas Gohr$search = new CollectionSearch($collection); 219b9d7a615SAndreas Gohr$search->addTerm('wiki*'); 220b9d7a615SAndreas Gohr$terms = $search->execute(); 221b9d7a615SAndreas Gohrforeach ($terms as $term) { 2221148921dSAndreas Gohr $term->getEntityFrequencies(); // [entityName => totalFrequency, ...] 2231148921dSAndreas Gohr $term->getEntityTokens(); // [entityName => [tokenName, ...], ...] 2241148921dSAndreas Gohr $term->getMatches(); // [entityName => [tokenName => freq, ...], ...] 225b9d7a615SAndreas Gohr} 226b9d7a615SAndreas Gohr</code> 227b9d7a615SAndreas Gohr 2281148921dSAndreas Gohr''addTerm()'' returns a ''Term'' object. After ''execute()'', each Term holds the full match detail: which tokens matched on which entities with what frequencies. Various accessors provide different views on this data. 229b9d7a615SAndreas Gohr 230b9d7a615SAndreas GohrA ''Term'' represents a single search query component that can match one or more tokens in an index. Terms can include wildcards using the ''*'' character: 231b9d7a615SAndreas Gohr * ''wiki'' - matches exactly "wiki" 232b9d7a615SAndreas Gohr * ''wiki*'' - matches tokens starting with "wiki" (e.g., "wiki", "wikitext", "wikipedia") 233b9d7a615SAndreas Gohr * ''*wiki'' - matches tokens ending with "wiki" (e.g., "wiki", "dokuwiki") 234b9d7a615SAndreas Gohr * ''*wiki*'' - matches tokens containing "wiki" anywhere (e.g., "wiki", "dokuwiki", "wikitext") 235b9d7a615SAndreas Gohr 2361148921dSAndreas GohrMatching uses efficient string functions (''==='' for exact, ''str_starts_with''/''str_ends_with''/''str_contains'' for wildcards). For case-insensitive matching, call ''caseInsensitive()'' on the search or on individual terms. This is useful for metadata/title searches where indexed values preserve case (the fulltext token index is already lowercased by the Tokenizer). 2371148921dSAndreas Gohr 238b9d7a615SAndreas GohrTerms organize their matching tokens by length. This is crucial for working with split indexes: a term like ''*wiki*'' might match 4-letter words (wiki), 8-letter words (dokuwiki), and 9-letter words (wikilinks) but never 3-letter words, because the base term "wiki" is 4 letters long. Each length group can be looked up in the corresponding suffixed token index, allowing efficient searching without loading irrelevant files. 239b9d7a615SAndreas Gohr 240b9d7a615SAndreas GohrFor example, searching for ''wiki*'' might find: 2411148921dSAndreas Gohr * Token "wiki" appears 5 times on page "start" 2421148921dSAndreas Gohr * Token "wikitext" appears 3 times on page "start" 2431148921dSAndreas Gohr * ''getEntityFrequencies()'' returns ''['start' => 8]'' 2441148921dSAndreas Gohr * ''getMatches()'' returns ''['start' => ['wiki' => 5, 'wikitext' => 3]]'' 245b9d7a615SAndreas Gohr 2462a22d4b9SAndreas GohrTerm does not enforce minimum token length. For fulltext search, callers should filter short words before calling ''addTerm()'' using ''Tokenizer::isValidSearchTerm($term)''. 247b9d7a615SAndreas Gohr 248b9d7a615SAndreas Gohr==== Fulltext Search Query Processing ==== 249b9d7a615SAndreas Gohr 250b9d7a615SAndreas GohrFor fulltext searches a proper query language is supported (see [[:Search]]). Queries go through two stages: 251b9d7a615SAndreas Gohr 252b9d7a615SAndreas Gohr=== QueryParser === 253b9d7a615SAndreas Gohr 254b9d7a615SAndreas Gohr''QueryParser::convert($query)'' parses a search query string into an intermediate representation. It supports: 255b9d7a615SAndreas Gohr 256b9d7a615SAndreas Gohr * Individual words and phrases (quoted strings) 257b9d7a615SAndreas Gohr * Namespace filtering with ''@ns:'' or ''ns:'' and ''-ns:'' for exclusion 258b9d7a615SAndreas Gohr * Negation with ''-'' prefix 259b9d7a615SAndreas Gohr * Boolean ''OR'' between terms 260b9d7a615SAndreas Gohr * Grouping with parentheses 261b9d7a615SAndreas Gohr 262b9d7a615SAndreas GohrThe output includes an array in Reverse Polish Notation (RPN) used by the evaluator, plus extracted highlights, word lists, phrase lists, and namespace filters. 263b9d7a615SAndreas Gohr 264b9d7a615SAndreas Gohr=== QueryEvaluator === 265b9d7a615SAndreas Gohr 266b9d7a615SAndreas Gohr''QueryEvaluator'' takes the RPN array and the ''Term'' results from ''CollectionSearch'' and evaluates the boolean logic. It uses typed stack entries during processing: 267b9d7a615SAndreas Gohr 268b9d7a615SAndreas Gohr * **PageSet** - Concrete set of pages with scores. Supports intersect (AND), unite (OR), subtract (NOT). 269b9d7a615SAndreas Gohr * **NamespacePredicate** - Lazy filter that only materializes when combined with a PageSet. 270b9d7a615SAndreas Gohr * **NegatedEntry** - Wraps another entry to represent logical NOT, allowing AND to convert it to set subtraction. 271b9d7a615SAndreas Gohr 272b9d7a615SAndreas GohrThe result is a list of matching pages and their frequency scores. 273b9d7a615SAndreas Gohr 274b9d7a615SAndreas GohrPhrase verification reads the raw wiki text of candidate pages. Plugins can override this via [[devel:event:FULLTEXT_PHRASE_MATCH]]. 275b9d7a615SAndreas Gohr 276b9d7a615SAndreas Gohr 277b9d7a615SAndreas Gohr==== Background Indexing ==== 278b9d7a615SAndreas Gohr 279b9d7a615SAndreas GohrPages are indexed asynchronously by the [[:taskrunner|TaskRunner]] which is triggered after each page view. It calls ''Indexer::addPage()'' for pages that need re-indexing and ''Indexer::deletePage()'' for pages that no longer exist on disk. The CLI tool ''bin/indexer.php'' can be used to index all pages at once. 280b9d7a615SAndreas Gohr 281b9d7a615SAndreas GohrThe [[devel:event:INDEXER_TASKS_RUN]] event fires during background task execution, allowing plugins to hook their own maintenance tasks into the indexing cycle. 282b9d7a615SAndreas Gohr 283b9d7a615SAndreas Gohr===== Plugin Events ===== 284b9d7a615SAndreas Gohr 285b9d7a615SAndreas GohrThe search system fires several events that plugins can use to extend or modify indexing and search behavior. 286b9d7a615SAndreas Gohr 287b9d7a615SAndreas GohrIndexing: 288b9d7a615SAndreas Gohr * [[devel:event:INDEXER_VERSION_GET]] - Plugins add their version to force re-indexing when the plugin changes. 289b9d7a615SAndreas Gohr * [[devel:event:INDEXER_PAGE_ADD]] - Modify page body, title, or metadata before it enters the index. 290b9d7a615SAndreas Gohr * [[devel:event:INDEXER_TEXT_PREPARE]] - Pre-process text before tokenization. 291b9d7a615SAndreas Gohr * [[devel:event:INDEXER_TASKS_RUN]] - Hook into the background task runner. 292b9d7a615SAndreas Gohr 293b9d7a615SAndreas GohrSearching: 294b9d7a615SAndreas Gohr * [[devel:event:SEARCH_QUERY_FULLPAGE]] - Intercept or replace fulltext search. 295b9d7a615SAndreas Gohr * [[devel:event:SEARCH_QUERY_PAGELOOKUP]] - Intercept or replace page name lookup. 296b9d7a615SAndreas Gohr * [[devel:event:FULLTEXT_SNIPPET_CREATE]] - Provide custom search result snippets. 297b9d7a615SAndreas Gohr * [[devel:event:FULLTEXT_PHRASE_MATCH]] - Override phrase matching logic. 298b9d7a615SAndreas Gohr 299b9d7a615SAndreas Gohr===== Exceptions ===== 300b9d7a615SAndreas Gohr 301b9d7a615SAndreas GohrAll search-related exceptions extend ''SearchException'': 302b9d7a615SAndreas Gohr 303b9d7a615SAndreas Gohr * ''SearchException'' - Base class for search/index errors 304b9d7a615SAndreas Gohr * ''IndexAccessException'' - Failed to read an index file 305b9d7a615SAndreas Gohr * ''IndexWriteException'' - Failed to write to an index file 306b9d7a615SAndreas Gohr * ''IndexLockException'' - Failed to acquire or release a lock 307b9d7a615SAndreas Gohr * ''IndexUsageException'' - Incorrect API usage (eg. writing without a lock) 308b9d7a615SAndreas Gohr * ''IndexIntegrityException'' - Structural inconsistency detected in the indexes 309