xref: /dokuwiki/inc/Search/concept.txt (revision 7f394dd683d14aba9f6131a7f9d011383c280834)
1====== Search Indexing ======
2
3This document is meant to describe the various concepts behind the indexing mechanism. It's work in progress and will be moved to the wiki once the new index code is merged.
4
5===== Uses =====
6
7The indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future.
8
9===== Collections =====
10
11A collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection.
12
13Please note: because index has a specific meaning in our context (see below) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection.
14
15There are basically two different complexities of collections
16
17  * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case.
18  * lookup collections - Basically the same as frequency collection, but each token appears only once per entity thus all frequencies are 1. Searches do not care for the frequency but are only interested if a token appears for the entity or not
19  * direct collections - Here a 1:1 relation between the entity and a token exists. For example a page has exactly one title.
20
21A collection works on four indexes:
22
23  * entity - This is the main entity that will be the result of a search. Eg. a page
24  * token - This is the actual information strewn across the entities. Eg. words
25    * can be split into several files (usally based on token length)
26  * frequency - This maps tokens to entities and records their frequency. Eg. freq(words)->page
27    * if tokens are split into several files, this is too
28  * reverse - This records which tokens are on a specific entities. This is mostly used for internal index cleanup. Eg. page-words
29
30The latter two index files do not exist for direct collections
31
32Collections can be searched for one or more ''terms'' via a CollectionSearch class. In the most simple case a term is a single token. But it is also possible to use wildcards signifiers ''*'' at the start and end of a term. In that case, the term refers to a list of tokens. A CollectionSearch will return the frequency with which each term occurs on each entity.
33
34==== List of Collections =====
35
36^ Name                   ^ Entity ^ Token                 ^ Frequency             ^ Reverse               ^ Uses split Tokens? ^
37| FullText               | page   | w*                    | i*                    | pageword              | yes                |
38| MetaRelationMedia      | page   | relation_media_w      | relation_media_i      | relation_media_p      | no                 |
39| MetaRelationReferences | page   | relation_references_w | relation_references_i | relation_references_p | no                 |
40
41
42
43
44==== Terms ====
45
46A ''Term'' is a representation of a single search query component that can match one or more tokens in an index. The Term class is used by CollectionSearch implementations to handle wildcard searches and track which entities contain matching tokens.
47
48**Wildcard Support:**
49
50Terms can include wildcards using the ''*'' character:
51  * ''wiki'' - matches exactly "wiki"
52  * ''wiki*'' - matches tokens starting with "wiki" (e.g., "wiki", "wikitext", "wikipedia")
53  * ''*wiki'' - matches tokens ending with "wiki" (e.g., "wiki", "dokuwiki")
54  * ''*wiki*'' - matches tokens containing "wiki" anywhere (e.g., "wiki", "dokuwiki", "wikitext")
55
56The Term class internally handles these wildcards by:
57  * Storing the original term with wildcards
58  * Extracting the base term (without wildcard characters)
59  * Converting wildcards into a regular expression pattern for matching
60  * Tracking which type of wildcard is used (none, start, end, or both)
61
62**Length-Based Organization:**
63
64Terms organize their matching tokens by length. This is crucial for working with split indexes:
65  * A term like ''*wiki*'' might match 4-letter words (wiki), 8-letter words (dokuwiki), and 9-letter words (wikilinks) but never 3-letter words, because the base term "wiki" is 4 letters long.
66  * Each length group can be looked up in the corresponding suffixed token index
67  * This allows efficient searching across split indexes without loading irrelevant files
68
69**Token and Frequency Tracking:**
70
71During a search operation, Terms:
72  1. Collect all token IDs that match the term pattern (organized by token length)
73  2. Look up which entities contain those tokens
74  3. Aggregate the frequencies across all matching tokens
75  4. Map entity IDs to entity names for the final result
76
77For example, searching for ''wiki*'' might find:
78  * Token "wiki" (ID 42) appears 5 times on page "start" (ID 10)
79  * Token "wikitext" (ID 87) appears 3 times on page "start" (ID 10)
80  * Term result: "start" matches with total frequency 8
81
82**Validation:**
83
84Terms are validated on creation:
85  * Minimum length requirements are enforced (except for numeric terms)
86  * Terms that are too short throw a SearchException
87  * The base term (without wildcards) must meet the minimum token length configured in the Tokenizer
88
89===== Indexes ======
90
91Indexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies.
92
93Indexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code.
94
95Index files can be accessed through two classes:
96
97  * \dokuwiki\Search\Index\FileIndex
98  * \dokuwiki\Search\Index\MemoryIndex
99
100Both classes expose the same API, the only difference is their way of accessing the data.
101
102A FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index.
103
104The MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory.
105
106Which method to use depends mostly on the size of the file.
107
108Usually indexes are not accessed directly but through a collection. That collection will manage which type of access to use.
109
110Within an index two kinds of data can be stored per row:
111
112  * A single value. Eg. an entity or a token
113  * A list of tuples. Eg. a list of pageIDs and frequencies
114
115The former is straight forward, it's a simple ''rid -> value'' store.  The latter maps to ''rid -> [key -> value, ...]'' where key is usally the ''rid'' in another index.
116
117==== Index Types ====
118
119A Collection consists of 4 (or 2 for direct collections) index types:
120
121The **entity** index lists the main entity the index will return as a result. entity.RID -> entity
122
123The **token** index contains the tokens used to search (eg. words). token.RID -> token
124
125The **frequency** index contains tuples of entity.RIDs and usage frequencies. token.RID -> entity.RID*frequency:...
126
127The **reverse** index contains tuples of token.RIDs and usage frequencies. entity.RID -> token.RID*frequency:...
128
129Direct collections only use entity and token index files with entity.RID === token.RID
130
131
132==== Index File Splitting ====
133
134To improve memory efficiency and access speed, a single token index can be split into multiple physical files using suffixes. This is particularly useful for indexes that would otherwise grow too large to handle efficiently.
135
136When creating an index, you can specify a suffix parameter that gets appended to the base index name to create the actual filename. For example:
137
138  * Base name: ''w'' (for word tokens)
139  * Suffix: ''3'' (for 3-letter words)
140  * Resulting file: ''w3.idx''
141
142A common use case is splitting token indexes by word length. In a fulltext collection:
143
144  * ''w3.idx'' - stores all 3-letter words
145  * ''w4.idx'' - stores all 4-letter words
146  * ''w5.idx'' - stores all 5-letter words
147  * and so on...
148
149When an index uses suffixes, the ''max()'' method can be used to find the highest numeric suffix currently in use. This is useful for operations that need to iterate over all splits of an index (eg. when a Term is using a wildcard).
150
151==== Tuple Data Format ====
152
153Tuple-based index rows store associations between keys (typically RIDs from another index) and numeric values (typically frequency counts). The internal format uses a compact string representation:
154
155<code>
156key*count:key*count:key*count
157</code>
158
159Where:
160  * ''key'' - Usually the RID from another index (e.g., a page ID)
161  * ''count'' - A numeric value (e.g., how many times a word appears on that page)
162  * '':'' - Separates individual tuples
163  * ''*'' - Separates the key from its count within a tuple
164
165**Example:** A frequency index row for a word might look like:
166<code>
16742*5:17*3:98*12
168</code>
169
170This means:
171  * Entity with RID 42 contains this word 5 times
172  * Entity with RID 17 contains this word 3 times
173  * Entity with RID 98 contains this word 12 times
174
175Frequencies of 1 are not stored in the index. For example:
176
177<code>
17842*5:17:98
179</code>
180
181In the above case would be interpreted as
182
183  * Entity with RID 42 contains this word 5 times
184  * Entity with RID 17 contains this word 1 times
185  * Entity with RID 98 contains this word 1 times
186
187The ''TupleOps'' class provides utility methods for working with tuple records:
188  * ''updateTuple()'' - Insert or update a specific key->count pair
189  * ''parseTuples()'' - Parse a record into an array of key->count associations
190  * ''aggregateTupleCounts()'' - Sum all counts in a record
191
192===== Locking =====
193
194Only one process may write to an index at any time. To ensure this, a locking mechanism has to be employed.
195
196Indexes can be read in write or readonly mode according to the acquired locks. However, managing locks has to be done outside the index. Usually within a collection.
197
198The ''Lock'' class is used to acquire the needed locks.
199