inc/Search/concept.txt

596d5287SAndreas Gohr====== Search Indexing ======
596d5287SAndreas Gohr
596d5287SAndreas GohrThe indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future.
596d5287SAndreas Gohr
b9d7a615SAndreas Gohr===== Core API =====
596d5287SAndreas Gohr
b9d7a615SAndreas GohrMost code interacting with the search index will use one of three high-level classes. All live in the ''\dokuwiki\Search'' namespace.
596d5287SAndreas Gohr
b9d7a615SAndreas Gohr==== Indexer ====
596d5287SAndreas Gohr
b9d7a615SAndreas GohrThe ''Indexer'' class manages the search index. It coordinates all collections and handles the actual writing.
f2bbffb5SAndreas Gohr
b9d7a615SAndreas Gohr  * ''addPage($page, $force)'' - Add or re-index a page. Handles locking, tokenization, and metadata internally.
b9d7a615SAndreas Gohr  * ''deletePage($page, $force)'' - Remove a page from all indexes.
b9d7a615SAndreas Gohr  * ''renamePage($oldpage, $newpage)'' - Update the page name in the entity index.
b9d7a615SAndreas Gohr  * ''needsIndexing($page, $force)'' - Check whether a page needs (re-)indexing based on version and modification time.
b9d7a615SAndreas Gohr  * ''getAllPages($existsFilter)'' - Return all indexed page names. Optionally filter to only pages that exist on disk.
b9d7a615SAndreas Gohr  * ''getVersion()'' - Return the indexer version string, including plugin versions (see [[devel:event:INDEXER_VERSION_GET]]).
b9d7a615SAndreas Gohr  * ''clear()'' - Delete all index files.
b9d7a615SAndreas Gohr  * ''checkIntegrity()'' - Verify structural consistency across all indexes.
b9d7a615SAndreas Gohr  * ''setLogger($callback)'' - Register a logging callback for progress output.
f2bbffb5SAndreas Gohr
b9d7a615SAndreas Gohr==== FulltextSearch ====
596d5287SAndreas Gohr
b9d7a615SAndreas GohrThe ''FulltextSearch'' class handles fulltext search queries.
596d5287SAndreas Gohr
b9d7a615SAndreas Gohr  * ''pageSearch($query, &$highlight, $sort, $after, $before)'' - Run a fulltext search. Returns matching pages as ''pageid => score''. The ''$highlight'' array is filled with terms to highlight. ''$sort'' can be ''"hits"'' (default) or ''"mtime"''. ''$after''/''$before'' filter by modification time.
b9d7a615SAndreas Gohr  * ''snippet($id, $highlight)'' - Generate a search result snippet for a page.
f2bbffb5SAndreas Gohr
b9d7a615SAndreas Gohr==== MetadataSearch ====
f2bbffb5SAndreas Gohr
b9d7a615SAndreas GohrThe ''MetadataSearch'' class provides search operations on page metadata.
f2bbffb5SAndreas Gohr
b9d7a615SAndreas Gohr  * ''pageLookup($id, $in_ns, $in_title, $after, $before)'' - Quick search for page names. Optionally matches against the namespace and title.
1148921dSAndreas Gohr  * ''lookupKey($key, $value)'' - Find pages by metadata value. Supports exact match and wildcards (''*'' at start/end). When ''$value'' is a string the result is a flat list of page names. When it is an array, the result is keyed by each search value.
b9d7a615SAndreas Gohr  * ''backlinks($id, $ignore_perms)'' - Find all pages linking to ''$id''.
b9d7a615SAndreas Gohr  * ''mediause($id, $ignore_perms)'' - Find all pages using a media file.
b9d7a615SAndreas Gohr  * ''getPages($key)'' - Return all indexed pages, optionally limited to those having a value for the given metadata key.
596d5287SAndreas Gohr
b9d7a615SAndreas Gohr===== Internals =====
596d5287SAndreas Gohr
b9d7a615SAndreas GohrThe following sections describe how the search index is structured and how the core classes work together to provide the indexing and search functionality. This is meant for developers who want to understand the inner workings of the search system or make use of it in their own plugins.
596d5287SAndreas Gohr
b9d7a615SAndreas Gohr==== Indexes ====
596d5287SAndreas Gohr
596d5287SAndreas GohrIndexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies.
596d5287SAndreas Gohr
596d5287SAndreas GohrIndexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code.
596d5287SAndreas Gohr
b9d7a615SAndreas GohrAll index files are stored in the ''data/index'' directory. The file name is the name of the index with an ''idx'' extension. For example, the page name index is stored in ''data/index/page.idx''. Some indexes have additional suffixes (eg. ''w3.idx'') to split the data into multiple files (see [[#Index File Splitting]] below).
b9d7a615SAndreas Gohr
596d5287SAndreas GohrIndex files can be accessed through two classes:
596d5287SAndreas Gohr
b9d7a615SAndreas Gohr  * ''\dokuwiki\Search\Index\FileIndex''
b9d7a615SAndreas Gohr  * ''\dokuwiki\Search\Index\MemoryIndex''
596d5287SAndreas Gohr
596d5287SAndreas GohrBoth classes expose the same API, the only difference is their way of accessing the data.
596d5287SAndreas Gohr
596d5287SAndreas GohrA FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index.
596d5287SAndreas Gohr
596d5287SAndreas GohrThe MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory.
596d5287SAndreas Gohr
596d5287SAndreas GohrWhich method to use depends mostly on the size of the file.
596d5287SAndreas Gohr
596d5287SAndreas GohrUsually indexes are not accessed directly but through a collection. That collection will manage which type of access to use.
596d5287SAndreas Gohr
596d5287SAndreas GohrWithin an index two kinds of data can be stored per row:
596d5287SAndreas Gohr
596d5287SAndreas Gohr  * A single value. Eg. an entity or a token
596d5287SAndreas Gohr  * A list of tuples. Eg. a list of pageIDs and frequencies
596d5287SAndreas Gohr
b9d7a615SAndreas GohrThe former is straight forward, it's a simple ''rid -> value'' store. The latter maps to ''rid -> [key -> value, ...]'' where key is usually the ''rid'' in another index.
596d5287SAndreas Gohr
b9d7a615SAndreas Gohr=== Index File Splitting ===
8ae94493SAndreas Gohr
b9d7a615SAndreas GohrTo improve memory efficiency and access speed, token and frequency indexes can be split into multiple physical files using suffixes based on token length. A suffix parameter is appended to the base index name to create the actual filename. For example:
8ae94493SAndreas Gohr
8ae94493SAndreas Gohr  * Base name: ''w'' (for word tokens)
f2bbffb5SAndreas Gohr  * Suffix: ''3'' (for 3-letter tokens)
8ae94493SAndreas Gohr  * Resulting file: ''w3.idx''
8ae94493SAndreas Gohr
b9d7a615SAndreas Gohr> Note: token lengths are counted in bytes, not characters. This means that for languages with multi-byte characters, the suffixes will reflect the byte length of the tokens, which may differ from the character count.
b9d7a615SAndreas Gohr
f2bbffb5SAndreas GohrIn a fulltext collection with splitting enabled:
8ae94493SAndreas Gohr
f2bbffb5SAndreas Gohr  * ''w3.idx'' / ''i3.idx'' - stores all 3-letter tokens and their frequencies
f2bbffb5SAndreas Gohr  * ''w4.idx'' / ''i4.idx'' - stores all 4-letter tokens and their frequencies
f2bbffb5SAndreas Gohr  * ''w5.idx'' / ''i5.idx'' - stores all 5-letter tokens and their frequencies
8ae94493SAndreas Gohr  * and so on...
8ae94493SAndreas Gohr
f2bbffb5SAndreas GohrWhen splitting is disabled, a single file is used for each index (eg. ''relation_media_w.idx'').
f2bbffb5SAndreas Gohr
8ae94493SAndreas GohrWhen an index uses suffixes, the ''max()'' method can be used to find the highest numeric suffix currently in use. This is useful for operations that need to iterate over all splits of an index (eg. when a Term is using a wildcard).
8ae94493SAndreas Gohr
b9d7a615SAndreas Gohr=== Tuple Data Format ===
8ae94493SAndreas Gohr
b9d7a615SAndreas GohrTuple-based index rows store associations between keys (typically ''rid''s from another index) and numeric values (typically frequency counts). The internal format uses a compact string representation:
8ae94493SAndreas Gohr
8ae94493SAndreas Gohr<code>
8ae94493SAndreas Gohrkey*count:key*count:key*count
8ae94493SAndreas Gohr</code>
8ae94493SAndreas Gohr
8ae94493SAndreas GohrWhere:
b9d7a615SAndreas Gohr  * ''key'' - Usually the ''rid'' from another index (e.g., a page ID)
8ae94493SAndreas Gohr  * ''count'' - A numeric value (e.g., how many times a word appears on that page)
8ae94493SAndreas Gohr  * '':'' - Separates individual tuples
8ae94493SAndreas Gohr  * ''*'' - Separates the key from its count within a tuple
8ae94493SAndreas Gohr
8ae94493SAndreas Gohr**Example:** A frequency index row for a word might look like:
8ae94493SAndreas Gohr<code>
8ae94493SAndreas Gohr42*5:17*3:98*12
8ae94493SAndreas Gohr</code>
8ae94493SAndreas Gohr
8ae94493SAndreas GohrThis means:
8ae94493SAndreas Gohr  * Entity with RID 42 contains this word 5 times
8ae94493SAndreas Gohr  * Entity with RID 17 contains this word 3 times
8ae94493SAndreas Gohr  * Entity with RID 98 contains this word 12 times
8ae94493SAndreas Gohr
8ae94493SAndreas GohrFrequencies of 1 are not stored in the index. For example:
8ae94493SAndreas Gohr
8ae94493SAndreas Gohr<code>
8ae94493SAndreas Gohr42*5:17:98
8ae94493SAndreas Gohr</code>
8ae94493SAndreas Gohr
8ae94493SAndreas GohrIn the above case would be interpreted as
8ae94493SAndreas Gohr
8ae94493SAndreas Gohr  * Entity with RID 42 contains this word 5 times
8ae94493SAndreas Gohr  * Entity with RID 17 contains this word 1 times
8ae94493SAndreas Gohr  * Entity with RID 98 contains this word 1 times
8ae94493SAndreas Gohr
8ae94493SAndreas GohrThe ''TupleOps'' class provides utility methods for working with tuple records:
8ae94493SAndreas Gohr  * ''updateTuple()'' - Insert or update a specific key->count pair
8ae94493SAndreas Gohr  * ''parseTuples()'' - Parse a record into an array of key->count associations
8ae94493SAndreas Gohr  * ''aggregateTupleCounts()'' - Sum all counts in a record
8ae94493SAndreas Gohr
b9d7a615SAndreas Gohr==== Collections ====
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrA collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr> Please note: because index has a specific meaning in our context (see above) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrA collection manages up to four indexes:
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr  * **entity** - The main entity that will be the result of a search. Eg. a page. entity.RID -> entity
b9d7a615SAndreas Gohr  * **token** - The actual information strewn across the entities. Eg. words. token.RID -> token
b9d7a615SAndreas Gohr  * **frequency** - Maps tokens to entities and records their frequency. token.RID -> entity.RID*frequency:...
b9d7a615SAndreas Gohr  * **reverse** - Records which tokens are assigned to each entity. Used for updating: when an entity is re-indexed, the old reverse record provides the list of tokens to clean up.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrThe reverse index format depends on whether the collection uses split indexes:
b9d7a615SAndreas Gohr  * **Split collections**: Each entry is a ''tokenLength*tokenId'' pair because the token length is needed to locate the correct split index file. Format: ''tokenLength*tokenId:tokenLength*tokenId:...''
b9d7a615SAndreas Gohr  * **Non-split collections**: Only the token ID is needed since all tokens live in a single file. Format: ''tokenId:tokenId:...''
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrCollections have two independent properties: a type and whether they use split indexes or not.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrThe **collection type** determines how tokens relate to entities:
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr  * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case.
b9d7a615SAndreas Gohr  * lookup collections - Basically the same as frequency collections, but each token appears only once per entity thus all frequencies are 1. Searches do not care for the frequency but are only interested if a token appears for the entity or not. Internally the same mechanisms are used; only the way tokens are processed on input differs (deduplication instead of counting).
b9d7a615SAndreas Gohr  * direct collections - Here a 1:1 relation between the entity and a token exists. For example a page has exactly one title. Direct collections only use entity and token index files (entity.RID === token.RID), no frequency or reverse indexes.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrIndependently of the collection type, a collection can use **split or non-split token indexes**. See the [[#Index File Splitting]] section above.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr^ Name                   ^ Type      ^ Split? ^ Entity ^ Token                 ^ Frequency             ^ Reverse               ^
b9d7a615SAndreas Gohr| FullText               | frequency | yes    | page   | w*                    | i*                    | pageword              |
b9d7a615SAndreas Gohr| Title                  | direct    | no     | page   | title                 | -                     | -                     |
b9d7a615SAndreas Gohr| MetaRelationMedia      | lookup    | no     | page   | relation_media_w      | relation_media_i      | relation_media_p      |
b9d7a615SAndreas Gohr| MetaRelationReferences | lookup    | no     | page   | relation_references_w | relation_references_i | relation_references_p |
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr=== Writing data ===
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr''addEntity($entity, $tokens)'' is the main method for writing data to a collection. It replaces all previously stored tokens for the given entity. An empty token list removes the entity's data. The collection must be locked before calling this method.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr<code php>
b9d7a615SAndreas Gohr$collection = new PageFulltextCollection($pageIndex);
b9d7a615SAndreas Gohr$collection->lock();
b9d7a615SAndreas Gohr$collection->addEntity('wiki:page', $words);
b9d7a615SAndreas Gohr$collection->unlock();
b9d7a615SAndreas Gohr</code>
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrInternally, ''addEntity()'' reads the reverse index to find the entity's old tokens, resolves the new tokens to IDs (creating them in the token index if needed), merges old and new, and updates the frequency and reverse indexes accordingly. Tokens no longer present are automatically removed from the frequency index.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrFor direct collections, ''addEntity()'' simply writes the first token at the entity's position in the token index.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr=== Reading data ===
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrCollections provide some basic information retrieval methods, but they are not meant for searching.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr  * ''getEntitiesWithData()'' - Return all entity names that have data in this collection.
b9d7a615SAndreas Gohr  * For direct collections, ''getToken($entity)'' retrieves the single token stored for an entity (eg. a page title).
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrSearching across a collection is done through the ''CollectionSearch'' class (see below).
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr==== Locking ====
596d5287SAndreas Gohr
596d5287SAndreas GohrOnly one process may write to an index at any time. To ensure this, a locking mechanism has to be employed.
596d5287SAndreas Gohr
b9d7a615SAndreas GohrIndexes are opened in readonly mode by default. Passing ''$isWritable = true'' to the constructor (or calling ''lock()'' later) acquires a lock and enables writing. Calling ''unlock()'' releases it.
596d5287SAndreas Gohr
b9d7a615SAndreas GohrThe ''Lock'' class is a static registry with reference counting. ''Lock::acquire($name)'' creates a filesystem lock directory. Multiple calls within the same process share a single lock via reference counting. ''Lock::release($name)'' decrements the count and removes the directory when it reaches zero. Stale locks older than 5 minutes are automatically broken.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrCollections call ''lock()'' to acquire locks for all their indexes at once, and ''unlock()'' to release them.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr==== Tokenizer ====
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrThe ''Tokenizer'' class (in ''\dokuwiki\Search'') is responsible for splitting text into indexable tokens.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr''Tokenizer::getWords($text, $wc)'' splits the given text into an array of lowercase tokens. Tokens shorter than the minimum word length (default 2, configurable via ''IDX_MINWORDLENGTH'') are discarded, as are language-specific stop words loaded from ''inc/lang/<lang>/stopwords.txt''. Asian characters receive special treatment: they are separated into individual characters and measured with a length function that accounts for multi-byte sequences.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrWhen ''$wc'' is true, wildcard characters (''*'') are preserved in the output. This is used by the query parser.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrThe [[devel:event:INDEXER_TEXT_PREPARE]] event fires before tokenization, allowing plugins to pre-process the text.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr==== CollectionSearch and Terms ====
b9d7a615SAndreas Gohr
1148921dSAndreas GohrThe ''CollectionSearch'' class executes searches against any collection. Use ''addTerm()'' to register search terms, then call ''execute()''.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr<code php>
b9d7a615SAndreas Gohr$search = new CollectionSearch($collection);
b9d7a615SAndreas Gohr$search->addTerm('wiki*');
b9d7a615SAndreas Gohr$terms = $search->execute();
b9d7a615SAndreas Gohrforeach ($terms as $term) {
1148921dSAndreas Gohr    $term->getEntityFrequencies(); // [entityName => totalFrequency, ...]
1148921dSAndreas Gohr    $term->getEntityTokens();      // [entityName => [tokenName, ...], ...]
1148921dSAndreas Gohr    $term->getMatches();           // [entityName => [tokenName => freq, ...], ...]
b9d7a615SAndreas Gohr}
b9d7a615SAndreas Gohr</code>
b9d7a615SAndreas Gohr
1148921dSAndreas Gohr''addTerm()'' returns a ''Term'' object. After ''execute()'', each Term holds the full match detail: which tokens matched on which entities with what frequencies. Various accessors provide different views on this data.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrA ''Term'' represents a single search query component that can match one or more tokens in an index. Terms can include wildcards using the ''*'' character:
b9d7a615SAndreas Gohr  * ''wiki'' - matches exactly "wiki"
b9d7a615SAndreas Gohr  * ''wiki*'' - matches tokens starting with "wiki" (e.g., "wiki", "wikitext", "wikipedia")
b9d7a615SAndreas Gohr  * ''*wiki'' - matches tokens ending with "wiki" (e.g., "wiki", "dokuwiki")
b9d7a615SAndreas Gohr  * ''*wiki*'' - matches tokens containing "wiki" anywhere (e.g., "wiki", "dokuwiki", "wikitext")
b9d7a615SAndreas Gohr
1148921dSAndreas GohrMatching uses efficient string functions (''==='' for exact, ''str_starts_with''/''str_ends_with''/''str_contains'' for wildcards). For case-insensitive matching, call ''caseInsensitive()'' on the search or on individual terms. This is useful for metadata/title searches where indexed values preserve case (the fulltext token index is already lowercased by the Tokenizer).
1148921dSAndreas Gohr
b9d7a615SAndreas GohrTerms organize their matching tokens by length. This is crucial for working with split indexes: a term like ''*wiki*'' might match 4-letter words (wiki), 8-letter words (dokuwiki), and 9-letter words (wikilinks) but never 3-letter words, because the base term "wiki" is 4 letters long. Each length group can be looked up in the corresponding suffixed token index, allowing efficient searching without loading irrelevant files.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrFor example, searching for ''wiki*'' might find:
1148921dSAndreas Gohr  * Token "wiki" appears 5 times on page "start"
1148921dSAndreas Gohr  * Token "wikitext" appears 3 times on page "start"
1148921dSAndreas Gohr  * ''getEntityFrequencies()'' returns ''['start' => 8]''
1148921dSAndreas Gohr  * ''getMatches()'' returns ''['start' => ['wiki' => 5, 'wikitext' => 3]]''
b9d7a615SAndreas Gohr
*2a22d4b9SAndreas GohrTerm does not enforce minimum token length. For fulltext search, callers should filter short words before calling ''addTerm()'' using ''Tokenizer::isValidSearchTerm($term)''.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr==== Fulltext Search Query Processing ====
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrFor fulltext searches a proper query language is supported (see [[:Search]]). Queries go through two stages:
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr=== QueryParser ===
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr''QueryParser::convert($query)'' parses a search query string into an intermediate representation. It supports:
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr  * Individual words and phrases (quoted strings)
b9d7a615SAndreas Gohr  * Namespace filtering with ''@ns:'' or ''ns:'' and ''-ns:'' for exclusion
b9d7a615SAndreas Gohr  * Negation with ''-'' prefix
b9d7a615SAndreas Gohr  * Boolean ''OR'' between terms
b9d7a615SAndreas Gohr  * Grouping with parentheses
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrThe output includes an array in Reverse Polish Notation (RPN) used by the evaluator, plus extracted highlights, word lists, phrase lists, and namespace filters.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr=== QueryEvaluator ===
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr''QueryEvaluator'' takes the RPN array and the ''Term'' results from ''CollectionSearch'' and evaluates the boolean logic. It uses typed stack entries during processing:
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr  * **PageSet** - Concrete set of pages with scores. Supports intersect (AND), unite (OR), subtract (NOT).
b9d7a615SAndreas Gohr  * **NamespacePredicate** - Lazy filter that only materializes when combined with a PageSet.
b9d7a615SAndreas Gohr  * **NegatedEntry** - Wraps another entry to represent logical NOT, allowing AND to convert it to set subtraction.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrThe result is a list of matching pages and their frequency scores.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrPhrase verification reads the raw wiki text of candidate pages. Plugins can override this via [[devel:event:FULLTEXT_PHRASE_MATCH]].
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr==== Background Indexing ====
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrPages are indexed asynchronously by the [[:taskrunner|TaskRunner]] which is triggered after each page view. It calls ''Indexer::addPage()'' for pages that need re-indexing and ''Indexer::deletePage()'' for pages that no longer exist on disk. The CLI tool ''bin/indexer.php'' can be used to index all pages at once.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrThe [[devel:event:INDEXER_TASKS_RUN]] event fires during background task execution, allowing plugins to hook their own maintenance tasks into the indexing cycle.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr===== Plugin Events =====
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrThe search system fires several events that plugins can use to extend or modify indexing and search behavior.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrIndexing:
b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_VERSION_GET]] - Plugins add their version to force re-indexing when the plugin changes.
b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_PAGE_ADD]] - Modify page body, title, or metadata before it enters the index.
b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_TEXT_PREPARE]] - Pre-process text before tokenization.
b9d7a615SAndreas Gohr  * [[devel:event:INDEXER_TASKS_RUN]] - Hook into the background task runner.
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrSearching:
b9d7a615SAndreas Gohr  * [[devel:event:SEARCH_QUERY_FULLPAGE]] - Intercept or replace fulltext search.
b9d7a615SAndreas Gohr  * [[devel:event:SEARCH_QUERY_PAGELOOKUP]] - Intercept or replace page name lookup.
b9d7a615SAndreas Gohr  * [[devel:event:FULLTEXT_SNIPPET_CREATE]] - Provide custom search result snippets.
b9d7a615SAndreas Gohr  * [[devel:event:FULLTEXT_PHRASE_MATCH]] - Override phrase matching logic.
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr===== Exceptions =====
b9d7a615SAndreas Gohr
b9d7a615SAndreas GohrAll search-related exceptions extend ''SearchException'':
b9d7a615SAndreas Gohr
b9d7a615SAndreas Gohr  * ''SearchException'' - Base class for search/index errors
b9d7a615SAndreas Gohr  * ''IndexAccessException'' - Failed to read an index file
b9d7a615SAndreas Gohr  * ''IndexWriteException'' - Failed to write to an index file
b9d7a615SAndreas Gohr  * ''IndexLockException'' - Failed to acquire or release a lock
b9d7a615SAndreas Gohr  * ''IndexUsageException'' - Incorrect API usage (eg. writing without a lock)
b9d7a615SAndreas Gohr  * ''IndexIntegrityException'' - Structural inconsistency detected in the indexes