xref: /dokuwiki/inc/Search/concept.txt (revision 8ae94493764af24070e5e96ffb0e17c6afb6e411)
1596d5287SAndreas Gohr====== Search Indexing ======
2596d5287SAndreas Gohr
3*8ae94493SAndreas GohrThis document is meant to describe the various concepts behind the indexing mechanism. It's work in progress and will be moved to the wiki once the new index code is merged.
4596d5287SAndreas Gohr
5596d5287SAndreas Gohr===== Uses =====
6596d5287SAndreas Gohr
7596d5287SAndreas GohrThe indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future.
8596d5287SAndreas Gohr
9596d5287SAndreas Gohr===== Collections =====
10596d5287SAndreas Gohr
11596d5287SAndreas GohrA collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection.
12596d5287SAndreas Gohr
13596d5287SAndreas GohrPlease note: because index has a specific meaning in our context (see below) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection.
14596d5287SAndreas Gohr
15596d5287SAndreas GohrThere are basically two different complexities of collections
16596d5287SAndreas Gohr
17596d5287SAndreas Gohr  * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case.
18*8ae94493SAndreas Gohr  * lookup collections - Basically the same as frequency collection, but each token appears only once per entity thus all frequencies are 1. Searches do not care for the frequency but are only interested if a token appears for the entity or not
19*8ae94493SAndreas Gohr  * direct collections - Here a 1:1 relation between the entity and a token exists. For example a page has exactly one title.
20596d5287SAndreas Gohr
21596d5287SAndreas GohrA collection works on four indexes:
22596d5287SAndreas Gohr
23596d5287SAndreas Gohr  * entity - This is the main entity that will be the result of a search. Eg. a page
24596d5287SAndreas Gohr  * token - This is the actual information strewn across the entities. Eg. words
25*8ae94493SAndreas Gohr    * can be split into several files (usally based on token length)
26596d5287SAndreas Gohr  * frequency - This maps tokens to entities and records their frequency. Eg. freq(words)->page
27*8ae94493SAndreas Gohr    * if tokens are split into several files, this is too
28596d5287SAndreas Gohr  * reverse - This records which tokens are on a specific entities. This is mostly used for internal index cleanup. Eg. page-words
29596d5287SAndreas Gohr
30*8ae94493SAndreas GohrThe latter two index files do not exist for direct collections
31596d5287SAndreas Gohr
32596d5287SAndreas GohrCollections can be searched for one or more ''terms'' via a CollectionSearch class. In the most simple case a term is a single token. But it is also possible to use wildcards signifiers ''*'' at the start and end of a term. In that case, the term refers to a list of tokens. A CollectionSearch will return the frequency with which each term occurs on each entity.
33596d5287SAndreas Gohr
34*8ae94493SAndreas Gohr==== List of Collections =====
35*8ae94493SAndreas Gohr
36*8ae94493SAndreas Gohr^ Name                   ^ Entity ^ Token                 ^ Frequency             ^ Reverse               ^ Uses split Tokens? ^
37*8ae94493SAndreas Gohr| FullText               | page   | w*                    | i*                    | pageword              | yes                |
38*8ae94493SAndreas Gohr| MetaRelationMedia      | page   | relation_media_w      | relation_media_i      | relation_media_p      | no                 |
39*8ae94493SAndreas Gohr| MetaRelationReferences | page   | relation_references_w | relation_references_i | relation_references_p | no                 |
40*8ae94493SAndreas Gohr
41*8ae94493SAndreas Gohr
42*8ae94493SAndreas Gohr
43*8ae94493SAndreas Gohr
44*8ae94493SAndreas Gohr==== Terms ====
45*8ae94493SAndreas Gohr
46*8ae94493SAndreas GohrA ''Term'' is a representation of a single search query component that can match one or more tokens in an index. The Term class is used by CollectionSearch implementations to handle wildcard searches and track which entities contain matching tokens.
47*8ae94493SAndreas Gohr
48*8ae94493SAndreas Gohr**Wildcard Support:**
49*8ae94493SAndreas Gohr
50*8ae94493SAndreas GohrTerms can include wildcards using the ''*'' character:
51*8ae94493SAndreas Gohr  * ''wiki'' - matches exactly "wiki"
52*8ae94493SAndreas Gohr  * ''wiki*'' - matches tokens starting with "wiki" (e.g., "wiki", "wikitext", "wikipedia")
53*8ae94493SAndreas Gohr  * ''*wiki'' - matches tokens ending with "wiki" (e.g., "wiki", "dokuwiki")
54*8ae94493SAndreas Gohr  * ''*wiki*'' - matches tokens containing "wiki" anywhere (e.g., "wiki", "dokuwiki", "wikitext")
55*8ae94493SAndreas Gohr
56*8ae94493SAndreas GohrThe Term class internally handles these wildcards by:
57*8ae94493SAndreas Gohr  * Storing the original term with wildcards
58*8ae94493SAndreas Gohr  * Extracting the base term (without wildcard characters)
59*8ae94493SAndreas Gohr  * Converting wildcards into a regular expression pattern for matching
60*8ae94493SAndreas Gohr  * Tracking which type of wildcard is used (none, start, end, or both)
61*8ae94493SAndreas Gohr
62*8ae94493SAndreas Gohr**Length-Based Organization:**
63*8ae94493SAndreas Gohr
64*8ae94493SAndreas GohrTerms organize their matching tokens by length. This is crucial for working with split indexes:
65*8ae94493SAndreas Gohr  * A term like ''*wiki*'' might match 4-letter words (wiki), 8-letter words (dokuwiki), and 9-letter words (wikilinks)
66*8ae94493SAndreas Gohr  * Each length group can be looked up in the corresponding suffixed token index
67*8ae94493SAndreas Gohr  * This allows efficient searching across split indexes without loading irrelevant files
68*8ae94493SAndreas Gohr
69*8ae94493SAndreas Gohr**Token and Frequency Tracking:**
70*8ae94493SAndreas Gohr
71*8ae94493SAndreas GohrDuring a search operation, Terms:
72*8ae94493SAndreas Gohr  1. Collect all token IDs that match the term pattern (organized by token length)
73*8ae94493SAndreas Gohr  2. Look up which entities contain those tokens
74*8ae94493SAndreas Gohr  3. Aggregate the frequencies across all matching tokens
75*8ae94493SAndreas Gohr  4. Map entity IDs to entity names for the final result
76*8ae94493SAndreas Gohr
77*8ae94493SAndreas GohrFor example, searching for ''wiki*'' might find:
78*8ae94493SAndreas Gohr  * Token "wiki" (ID 42) appears 5 times on page "start" (ID 10)
79*8ae94493SAndreas Gohr  * Token "wikitext" (ID 87) appears 3 times on page "start" (ID 10)
80*8ae94493SAndreas Gohr  * Term result: "start" matches with total frequency 8
81*8ae94493SAndreas Gohr
82*8ae94493SAndreas Gohr**Validation:**
83*8ae94493SAndreas Gohr
84*8ae94493SAndreas GohrTerms are validated on creation:
85*8ae94493SAndreas Gohr  * Minimum length requirements are enforced (except for numeric terms)
86*8ae94493SAndreas Gohr  * Terms that are too short throw a SearchException
87*8ae94493SAndreas Gohr  * The base term (without wildcards) must meet the minimum token length configured in the Tokenizer
88596d5287SAndreas Gohr
89596d5287SAndreas Gohr===== Indexes ======
90596d5287SAndreas Gohr
91596d5287SAndreas GohrIndexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies.
92596d5287SAndreas Gohr
93596d5287SAndreas GohrIndexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code.
94596d5287SAndreas Gohr
95596d5287SAndreas GohrIndex files can be accessed through two classes:
96596d5287SAndreas Gohr
97596d5287SAndreas Gohr  * \dokuwiki\Search\Index\FileIndex
98596d5287SAndreas Gohr  * \dokuwiki\Search\Index\MemoryIndex
99596d5287SAndreas Gohr
100596d5287SAndreas GohrBoth classes expose the same API, the only difference is their way of accessing the data.
101596d5287SAndreas Gohr
102596d5287SAndreas GohrA FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index.
103596d5287SAndreas Gohr
104596d5287SAndreas GohrThe MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory.
105596d5287SAndreas Gohr
106596d5287SAndreas GohrWhich method to use depends mostly on the size of the file.
107596d5287SAndreas Gohr
108596d5287SAndreas GohrUsually indexes are not accessed directly but through a collection. That collection will manage which type of access to use.
109596d5287SAndreas Gohr
110596d5287SAndreas GohrWithin an index two kinds of data can be stored per row:
111596d5287SAndreas Gohr
112596d5287SAndreas Gohr  * A single value. Eg. an entity or a token
113596d5287SAndreas Gohr  * A list of tuples. Eg. a list of pageIDs and frequencies
114596d5287SAndreas Gohr
115596d5287SAndreas GohrThe former is straight forward, it's a simple ''rid -> value'' store.  The latter maps to ''rid -> [key -> value, ...]'' where key is usally the ''rid'' in another index.
116596d5287SAndreas Gohr
117*8ae94493SAndreas Gohr==== Index Types ====
118*8ae94493SAndreas Gohr
119*8ae94493SAndreas GohrA Collection consists of 4 (or 2 for direct collections) index types:
120*8ae94493SAndreas Gohr
121*8ae94493SAndreas GohrThe **entity** index lists the main entity the index will return as a result. entity.RID -> entity
122*8ae94493SAndreas Gohr
123*8ae94493SAndreas GohrThe **token** index contains the tokens used to search (eg. words). token.RID -> token
124*8ae94493SAndreas Gohr
125*8ae94493SAndreas GohrThe **frequency** index contains tuples of entity.RIDs and usage frequencies. token.RID -> entity.RID*frequency:...
126*8ae94493SAndreas Gohr
127*8ae94493SAndreas GohrThe **reverse** index contains tuples of token.RIDs and usage frequencies. entity.RID -> token.RID*frequency:...
128*8ae94493SAndreas Gohr
129*8ae94493SAndreas GohrDirect collections only use entity and token index files with entity.RID === token.RID
130*8ae94493SAndreas Gohr
131*8ae94493SAndreas Gohr
132*8ae94493SAndreas Gohr==== Index File Splitting ====
133*8ae94493SAndreas Gohr
134*8ae94493SAndreas GohrTo improve memory efficiency and access speed, a single token index can be split into multiple physical files using suffixes. This is particularly useful for indexes that would otherwise grow too large to handle efficiently.
135*8ae94493SAndreas Gohr
136*8ae94493SAndreas GohrWhen creating an index, you can specify a suffix parameter that gets appended to the base index name to create the actual filename. For example:
137*8ae94493SAndreas Gohr
138*8ae94493SAndreas Gohr  * Base name: ''w'' (for word tokens)
139*8ae94493SAndreas Gohr  * Suffix: ''3'' (for 3-letter words)
140*8ae94493SAndreas Gohr  * Resulting file: ''w3.idx''
141*8ae94493SAndreas Gohr
142*8ae94493SAndreas GohrA common use case is splitting token indexes by word length. In a fulltext collection:
143*8ae94493SAndreas Gohr
144*8ae94493SAndreas Gohr  * ''w3.idx'' - stores all 3-letter words
145*8ae94493SAndreas Gohr  * ''w4.idx'' - stores all 4-letter words
146*8ae94493SAndreas Gohr  * ''w5.idx'' - stores all 5-letter words
147*8ae94493SAndreas Gohr  * and so on...
148*8ae94493SAndreas Gohr
149*8ae94493SAndreas GohrWhen an index uses suffixes, the ''max()'' method can be used to find the highest numeric suffix currently in use. This is useful for operations that need to iterate over all splits of an index (eg. when a Term is using a wildcard).
150*8ae94493SAndreas Gohr
151*8ae94493SAndreas Gohr==== Tuple Data Format ====
152*8ae94493SAndreas Gohr
153*8ae94493SAndreas GohrTuple-based index rows store associations between keys (typically RIDs from another index) and numeric values (typically frequency counts). The internal format uses a compact string representation:
154*8ae94493SAndreas Gohr
155*8ae94493SAndreas Gohr<code>
156*8ae94493SAndreas Gohrkey*count:key*count:key*count
157*8ae94493SAndreas Gohr</code>
158*8ae94493SAndreas Gohr
159*8ae94493SAndreas GohrWhere:
160*8ae94493SAndreas Gohr  * ''key'' - Usually the RID from another index (e.g., a page ID)
161*8ae94493SAndreas Gohr  * ''count'' - A numeric value (e.g., how many times a word appears on that page)
162*8ae94493SAndreas Gohr  * '':'' - Separates individual tuples
163*8ae94493SAndreas Gohr  * ''*'' - Separates the key from its count within a tuple
164*8ae94493SAndreas Gohr
165*8ae94493SAndreas Gohr**Example:** A frequency index row for a word might look like:
166*8ae94493SAndreas Gohr<code>
167*8ae94493SAndreas Gohr42*5:17*3:98*12
168*8ae94493SAndreas Gohr</code>
169*8ae94493SAndreas Gohr
170*8ae94493SAndreas GohrThis means:
171*8ae94493SAndreas Gohr  * Entity with RID 42 contains this word 5 times
172*8ae94493SAndreas Gohr  * Entity with RID 17 contains this word 3 times
173*8ae94493SAndreas Gohr  * Entity with RID 98 contains this word 12 times
174*8ae94493SAndreas Gohr
175*8ae94493SAndreas GohrFrequencies of 1 are not stored in the index. For example:
176*8ae94493SAndreas Gohr
177*8ae94493SAndreas Gohr<code>
178*8ae94493SAndreas Gohr42*5:17:98
179*8ae94493SAndreas Gohr</code>
180*8ae94493SAndreas Gohr
181*8ae94493SAndreas GohrIn the above case would be interpreted as
182*8ae94493SAndreas Gohr
183*8ae94493SAndreas Gohr  * Entity with RID 42 contains this word 5 times
184*8ae94493SAndreas Gohr  * Entity with RID 17 contains this word 1 times
185*8ae94493SAndreas Gohr  * Entity with RID 98 contains this word 1 times
186*8ae94493SAndreas Gohr
187*8ae94493SAndreas GohrThe ''TupleOps'' class provides utility methods for working with tuple records:
188*8ae94493SAndreas Gohr  * ''updateTuple()'' - Insert or update a specific key->count pair
189*8ae94493SAndreas Gohr  * ''parseTuples()'' - Parse a record into an array of key->count associations
190*8ae94493SAndreas Gohr  * ''aggregateTupleCounts()'' - Sum all counts in a record
191*8ae94493SAndreas Gohr
192596d5287SAndreas Gohr===== Locking =====
193596d5287SAndreas Gohr
194596d5287SAndreas GohrOnly one process may write to an index at any time. To ensure this, a locking mechanism has to be employed.
195596d5287SAndreas Gohr
196596d5287SAndreas GohrIndexes can be read in write or readonly mode according to the acquired locks. However, managing locks has to be done outside the index. Usually within a collection.
197596d5287SAndreas Gohr
198596d5287SAndreas GohrThe ''Lock'' class is used to acquire the needed locks.
199