1====== Search Indexing ====== 2 3This document is meant to describe the various concepts behind the indexing mechanism. It's work in progress and will be moved to the wiki once the new index code is merged. 4 5===== Uses ===== 6 7The indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future. 8 9===== Collections ===== 10 11A collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection. 12 13Please note: because index has a specific meaning in our context (see below) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection. 14 15Collections have two independent properties: 16 17=== Collection Types === 18 19The collection type determines how tokens relate to entities: 20 21 * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case. 22 * lookup collections - Basically the same as frequency collections, but each token appears only once per entity thus all frequencies are 1. Searches do not care for the frequency but are only interested if a token appears for the entity or not. Internally the same mechanisms are used; only the way tokens are processed on input differs (deduplication instead of counting). 23 * direct collections - Here a 1:1 relation between the entity and a token exists. For example a page has exactly one title. 24 25=== Index File Splitting === 26 27Independently of the collection type, a collection can use split or non-split token indexes. When split, the token and frequency indexes are distributed across multiple files based on token length. This also affects the reverse index format (see below). Any collection type can use either mode. 28 29=== Indexes === 30 31A collection works on four indexes: 32 33 * entity - This is the main entity that will be the result of a search. Eg. a page 34 * token - This is the actual information strewn across the entities. Eg. words 35 * can be split into several files based on token length (see Index File Splitting below) 36 * frequency - This maps tokens to entities and records their frequency. Eg. freq(words)->page 37 * if tokens are split into several files, this is too 38 * reverse - This records which tokens are assigned to a specific entity. This is used for updating: when an entity is re-indexed, the old reverse record provides the list of tokens to clean up. The format depends on whether the collection uses split indexes or not. 39 40The latter two index files do not exist for direct collections 41 42Collections can be searched for one or more ''terms'' via a CollectionSearch class. In the most simple case a term is a single token. But it is also possible to use wildcards signifiers ''*'' at the start and end of a term. In that case, the term refers to a list of tokens. A CollectionSearch will return the frequency with which each term occurs on each entity. 43 44==== List of Collections ===== 45 46^ Name ^ Type ^ Split? ^ Entity ^ Token ^ Frequency ^ Reverse ^ 47| FullText | frequency | yes | page | w* | i* | pageword | 48| MetaRelationMedia | lookup | no | page | relation_media_w | relation_media_i | relation_media_p | 49| MetaRelationReferences | lookup | no | page | relation_references_w | relation_references_i | relation_references_p | 50 51 52 53 54==== Terms ==== 55 56A ''Term'' is a representation of a single search query component that can match one or more tokens in an index. The Term class is used by CollectionSearch implementations to handle wildcard searches and track which entities contain matching tokens. 57 58**Wildcard Support:** 59 60Terms can include wildcards using the ''*'' character: 61 * ''wiki'' - matches exactly "wiki" 62 * ''wiki*'' - matches tokens starting with "wiki" (e.g., "wiki", "wikitext", "wikipedia") 63 * ''*wiki'' - matches tokens ending with "wiki" (e.g., "wiki", "dokuwiki") 64 * ''*wiki*'' - matches tokens containing "wiki" anywhere (e.g., "wiki", "dokuwiki", "wikitext") 65 66The Term class internally handles these wildcards by: 67 * Storing the original term with wildcards 68 * Extracting the base term (without wildcard characters) 69 * Converting wildcards into a regular expression pattern for matching 70 * Tracking which type of wildcard is used (none, start, end, or both) 71 72**Length-Based Organization:** 73 74Terms organize their matching tokens by length. This is crucial for working with split indexes: 75 * A term like ''*wiki*'' might match 4-letter words (wiki), 8-letter words (dokuwiki), and 9-letter words (wikilinks) but never 3-letter words, because the base term "wiki" is 4 letters long. 76 * Each length group can be looked up in the corresponding suffixed token index 77 * This allows efficient searching across split indexes without loading irrelevant files 78 79**Token and Frequency Tracking:** 80 81During a search operation, Terms: 82 1. Collect all token IDs that match the term pattern (organized by token length) 83 2. Look up which entities contain those tokens 84 3. Aggregate the frequencies across all matching tokens 85 4. Map entity IDs to entity names for the final result 86 87For example, searching for ''wiki*'' might find: 88 * Token "wiki" (ID 42) appears 5 times on page "start" (ID 10) 89 * Token "wikitext" (ID 87) appears 3 times on page "start" (ID 10) 90 * Term result: "start" matches with total frequency 8 91 92**Validation:** 93 94Terms are validated on creation: 95 * Minimum length requirements are enforced (except for numeric terms) 96 * Terms that are too short throw a SearchException 97 * The base term (without wildcards) must meet the minimum token length configured in the Tokenizer 98 99===== Indexes ====== 100 101Indexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies. 102 103Indexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code. 104 105Index files can be accessed through two classes: 106 107 * \dokuwiki\Search\Index\FileIndex 108 * \dokuwiki\Search\Index\MemoryIndex 109 110Both classes expose the same API, the only difference is their way of accessing the data. 111 112A FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index. 113 114The MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory. 115 116Which method to use depends mostly on the size of the file. 117 118Usually indexes are not accessed directly but through a collection. That collection will manage which type of access to use. 119 120Within an index two kinds of data can be stored per row: 121 122 * A single value. Eg. an entity or a token 123 * A list of tuples. Eg. a list of pageIDs and frequencies 124 125The former is straight forward, it's a simple ''rid -> value'' store. The latter maps to ''rid -> [key -> value, ...]'' where key is usally the ''rid'' in another index. 126 127==== Index Types ==== 128 129A Collection consists of 4 (or 2 for direct collections) index types: 130 131The **entity** index lists the main entity the index will return as a result. entity.RID -> entity 132 133The **token** index contains the tokens used to search (eg. words). token.RID -> token 134 135The **frequency** index contains tuples of entity.RIDs and usage frequencies. token.RID -> entity.RID*frequency:... 136 137The **reverse** index records which tokens are assigned to each entity. Its format depends on whether the collection uses split indexes: 138 139 * **Split collections**: Each entry is a ''tokenLength*tokenId'' pair because the token length is needed to locate the correct split index file. Format: ''tokenLength*tokenId:tokenLength*tokenId:...'' 140 * **Non-split collections**: Only the token ID is needed since all tokens live in a single file. Format: ''tokenId:tokenId:...'' 141 142The reverse index does not store frequencies. It is used internally when updating an entity: the old reverse record provides the set of previously assigned tokens so they can be cleaned up (see the addEntity description in the Collection Types section above). 143 144Direct collections only use entity and token index files with entity.RID === token.RID 145 146 147==== Index File Splitting ==== 148 149To improve memory efficiency and access speed, a collection can split its token and frequency indexes into multiple physical files using suffixes based on token length. This is configured per collection and is independent of the collection type (frequency or lookup). 150 151When splitting is enabled, both the token index and the frequency index are distributed across files. A suffix parameter is appended to the base index name to create the actual filename. For example: 152 153 * Base name: ''w'' (for word tokens) 154 * Suffix: ''3'' (for 3-letter tokens) 155 * Resulting file: ''w3.idx'' 156 157In a fulltext collection with splitting enabled: 158 159 * ''w3.idx'' / ''i3.idx'' - stores all 3-letter tokens and their frequencies 160 * ''w4.idx'' / ''i4.idx'' - stores all 4-letter tokens and their frequencies 161 * ''w5.idx'' / ''i5.idx'' - stores all 5-letter tokens and their frequencies 162 * and so on... 163 164When splitting is disabled, a single file is used for each index (eg. ''relation_media_w.idx''). 165 166When an index uses suffixes, the ''max()'' method can be used to find the highest numeric suffix currently in use. This is useful for operations that need to iterate over all splits of an index (eg. when a Term is using a wildcard). 167 168==== Tuple Data Format ==== 169 170Tuple-based index rows store associations between keys (typically RIDs from another index) and numeric values (typically frequency counts). The internal format uses a compact string representation: 171 172<code> 173key*count:key*count:key*count 174</code> 175 176Where: 177 * ''key'' - Usually the RID from another index (e.g., a page ID) 178 * ''count'' - A numeric value (e.g., how many times a word appears on that page) 179 * '':'' - Separates individual tuples 180 * ''*'' - Separates the key from its count within a tuple 181 182**Example:** A frequency index row for a word might look like: 183<code> 18442*5:17*3:98*12 185</code> 186 187This means: 188 * Entity with RID 42 contains this word 5 times 189 * Entity with RID 17 contains this word 3 times 190 * Entity with RID 98 contains this word 12 times 191 192Frequencies of 1 are not stored in the index. For example: 193 194<code> 19542*5:17:98 196</code> 197 198In the above case would be interpreted as 199 200 * Entity with RID 42 contains this word 5 times 201 * Entity with RID 17 contains this word 1 times 202 * Entity with RID 98 contains this word 1 times 203 204The ''TupleOps'' class provides utility methods for working with tuple records: 205 * ''updateTuple()'' - Insert or update a specific key->count pair 206 * ''parseTuples()'' - Parse a record into an array of key->count associations 207 * ''aggregateTupleCounts()'' - Sum all counts in a record 208 209===== Locking ===== 210 211Only one process may write to an index at any time. To ensure this, a locking mechanism has to be employed. 212 213Indexes can be read in write or readonly mode according to the acquired locks. However, managing locks has to be done outside the index. Usually within a collection. 214 215The ''Lock'' class is used to acquire the needed locks. 216