xref: /dokuwiki/inc/Search/concept.txt (revision f2bbffb509f37291c31c023167b8aeb383079d6d)
1596d5287SAndreas Gohr====== Search Indexing ======
2596d5287SAndreas Gohr
38ae94493SAndreas GohrThis document is meant to describe the various concepts behind the indexing mechanism. It's work in progress and will be moved to the wiki once the new index code is merged.
4596d5287SAndreas Gohr
5596d5287SAndreas Gohr===== Uses =====
6596d5287SAndreas Gohr
7596d5287SAndreas GohrThe indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future.
8596d5287SAndreas Gohr
9596d5287SAndreas Gohr===== Collections =====
10596d5287SAndreas Gohr
11596d5287SAndreas GohrA collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection.
12596d5287SAndreas Gohr
13596d5287SAndreas GohrPlease note: because index has a specific meaning in our context (see below) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection.
14596d5287SAndreas Gohr
15*f2bbffb5SAndreas GohrCollections have two independent properties:
16*f2bbffb5SAndreas Gohr
17*f2bbffb5SAndreas Gohr=== Collection Types ===
18*f2bbffb5SAndreas Gohr
19*f2bbffb5SAndreas GohrThe collection type determines how tokens relate to entities:
20596d5287SAndreas Gohr
21596d5287SAndreas Gohr  * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case.
22*f2bbffb5SAndreas Gohr  * lookup collections - Basically the same as frequency collections, but each token appears only once per entity thus all frequencies are 1. Searches do not care for the frequency but are only interested if a token appears for the entity or not. Internally the same mechanisms are used; only the way tokens are processed on input differs (deduplication instead of counting).
238ae94493SAndreas Gohr  * direct collections - Here a 1:1 relation between the entity and a token exists. For example a page has exactly one title.
24596d5287SAndreas Gohr
25*f2bbffb5SAndreas Gohr=== Index File Splitting ===
26*f2bbffb5SAndreas Gohr
27*f2bbffb5SAndreas GohrIndependently of the collection type, a collection can use split or non-split token indexes. When split, the token and frequency indexes are distributed across multiple files based on token length. This also affects the reverse index format (see below). Any collection type can use either mode.
28*f2bbffb5SAndreas Gohr
29*f2bbffb5SAndreas Gohr=== Indexes ===
30*f2bbffb5SAndreas Gohr
31596d5287SAndreas GohrA collection works on four indexes:
32596d5287SAndreas Gohr
33596d5287SAndreas Gohr  * entity - This is the main entity that will be the result of a search. Eg. a page
34596d5287SAndreas Gohr  * token - This is the actual information strewn across the entities. Eg. words
35*f2bbffb5SAndreas Gohr    * can be split into several files based on token length (see Index File Splitting below)
36596d5287SAndreas Gohr  * frequency - This maps tokens to entities and records their frequency. Eg. freq(words)->page
378ae94493SAndreas Gohr    * if tokens are split into several files, this is too
38*f2bbffb5SAndreas Gohr  * reverse - This records which tokens are assigned to a specific entity. This is used for updating: when an entity is re-indexed, the old reverse record provides the list of tokens to clean up. The format depends on whether the collection uses split indexes or not.
39596d5287SAndreas Gohr
408ae94493SAndreas GohrThe latter two index files do not exist for direct collections
41596d5287SAndreas Gohr
42596d5287SAndreas GohrCollections can be searched for one or more ''terms'' via a CollectionSearch class. In the most simple case a term is a single token. But it is also possible to use wildcards signifiers ''*'' at the start and end of a term. In that case, the term refers to a list of tokens. A CollectionSearch will return the frequency with which each term occurs on each entity.
43596d5287SAndreas Gohr
448ae94493SAndreas Gohr==== List of Collections =====
458ae94493SAndreas Gohr
46*f2bbffb5SAndreas Gohr^ Name                   ^ Type      ^ Split? ^ Entity ^ Token                 ^ Frequency             ^ Reverse               ^
47*f2bbffb5SAndreas Gohr| FullText               | frequency | yes    | page   | w*                    | i*                    | pageword              |
48*f2bbffb5SAndreas Gohr| MetaRelationMedia      | lookup    | no     | page   | relation_media_w      | relation_media_i      | relation_media_p      |
49*f2bbffb5SAndreas Gohr| MetaRelationReferences | lookup    | no     | page   | relation_references_w | relation_references_i | relation_references_p |
508ae94493SAndreas Gohr
518ae94493SAndreas Gohr
528ae94493SAndreas Gohr
538ae94493SAndreas Gohr
548ae94493SAndreas Gohr==== Terms ====
558ae94493SAndreas Gohr
568ae94493SAndreas GohrA ''Term'' is a representation of a single search query component that can match one or more tokens in an index. The Term class is used by CollectionSearch implementations to handle wildcard searches and track which entities contain matching tokens.
578ae94493SAndreas Gohr
588ae94493SAndreas Gohr**Wildcard Support:**
598ae94493SAndreas Gohr
608ae94493SAndreas GohrTerms can include wildcards using the ''*'' character:
618ae94493SAndreas Gohr  * ''wiki'' - matches exactly "wiki"
628ae94493SAndreas Gohr  * ''wiki*'' - matches tokens starting with "wiki" (e.g., "wiki", "wikitext", "wikipedia")
638ae94493SAndreas Gohr  * ''*wiki'' - matches tokens ending with "wiki" (e.g., "wiki", "dokuwiki")
648ae94493SAndreas Gohr  * ''*wiki*'' - matches tokens containing "wiki" anywhere (e.g., "wiki", "dokuwiki", "wikitext")
658ae94493SAndreas Gohr
668ae94493SAndreas GohrThe Term class internally handles these wildcards by:
678ae94493SAndreas Gohr  * Storing the original term with wildcards
688ae94493SAndreas Gohr  * Extracting the base term (without wildcard characters)
698ae94493SAndreas Gohr  * Converting wildcards into a regular expression pattern for matching
708ae94493SAndreas Gohr  * Tracking which type of wildcard is used (none, start, end, or both)
718ae94493SAndreas Gohr
728ae94493SAndreas Gohr**Length-Based Organization:**
738ae94493SAndreas Gohr
748ae94493SAndreas GohrTerms organize their matching tokens by length. This is crucial for working with split indexes:
757f394dd6SAndreas Gohr  * A term like ''*wiki*'' might match 4-letter words (wiki), 8-letter words (dokuwiki), and 9-letter words (wikilinks) but never 3-letter words, because the base term "wiki" is 4 letters long.
768ae94493SAndreas Gohr  * Each length group can be looked up in the corresponding suffixed token index
778ae94493SAndreas Gohr  * This allows efficient searching across split indexes without loading irrelevant files
788ae94493SAndreas Gohr
798ae94493SAndreas Gohr**Token and Frequency Tracking:**
808ae94493SAndreas Gohr
818ae94493SAndreas GohrDuring a search operation, Terms:
828ae94493SAndreas Gohr  1. Collect all token IDs that match the term pattern (organized by token length)
838ae94493SAndreas Gohr  2. Look up which entities contain those tokens
848ae94493SAndreas Gohr  3. Aggregate the frequencies across all matching tokens
858ae94493SAndreas Gohr  4. Map entity IDs to entity names for the final result
868ae94493SAndreas Gohr
878ae94493SAndreas GohrFor example, searching for ''wiki*'' might find:
888ae94493SAndreas Gohr  * Token "wiki" (ID 42) appears 5 times on page "start" (ID 10)
898ae94493SAndreas Gohr  * Token "wikitext" (ID 87) appears 3 times on page "start" (ID 10)
908ae94493SAndreas Gohr  * Term result: "start" matches with total frequency 8
918ae94493SAndreas Gohr
928ae94493SAndreas Gohr**Validation:**
938ae94493SAndreas Gohr
948ae94493SAndreas GohrTerms are validated on creation:
958ae94493SAndreas Gohr  * Minimum length requirements are enforced (except for numeric terms)
968ae94493SAndreas Gohr  * Terms that are too short throw a SearchException
978ae94493SAndreas Gohr  * The base term (without wildcards) must meet the minimum token length configured in the Tokenizer
98596d5287SAndreas Gohr
99596d5287SAndreas Gohr===== Indexes ======
100596d5287SAndreas Gohr
101596d5287SAndreas GohrIndexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies.
102596d5287SAndreas Gohr
103596d5287SAndreas GohrIndexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code.
104596d5287SAndreas Gohr
105596d5287SAndreas GohrIndex files can be accessed through two classes:
106596d5287SAndreas Gohr
107596d5287SAndreas Gohr  * \dokuwiki\Search\Index\FileIndex
108596d5287SAndreas Gohr  * \dokuwiki\Search\Index\MemoryIndex
109596d5287SAndreas Gohr
110596d5287SAndreas GohrBoth classes expose the same API, the only difference is their way of accessing the data.
111596d5287SAndreas Gohr
112596d5287SAndreas GohrA FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index.
113596d5287SAndreas Gohr
114596d5287SAndreas GohrThe MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory.
115596d5287SAndreas Gohr
116596d5287SAndreas GohrWhich method to use depends mostly on the size of the file.
117596d5287SAndreas Gohr
118596d5287SAndreas GohrUsually indexes are not accessed directly but through a collection. That collection will manage which type of access to use.
119596d5287SAndreas Gohr
120596d5287SAndreas GohrWithin an index two kinds of data can be stored per row:
121596d5287SAndreas Gohr
122596d5287SAndreas Gohr  * A single value. Eg. an entity or a token
123596d5287SAndreas Gohr  * A list of tuples. Eg. a list of pageIDs and frequencies
124596d5287SAndreas Gohr
125596d5287SAndreas GohrThe former is straight forward, it's a simple ''rid -> value'' store.  The latter maps to ''rid -> [key -> value, ...]'' where key is usally the ''rid'' in another index.
126596d5287SAndreas Gohr
1278ae94493SAndreas Gohr==== Index Types ====
1288ae94493SAndreas Gohr
1298ae94493SAndreas GohrA Collection consists of 4 (or 2 for direct collections) index types:
1308ae94493SAndreas Gohr
1318ae94493SAndreas GohrThe **entity** index lists the main entity the index will return as a result. entity.RID -> entity
1328ae94493SAndreas Gohr
1338ae94493SAndreas GohrThe **token** index contains the tokens used to search (eg. words). token.RID -> token
1348ae94493SAndreas Gohr
1358ae94493SAndreas GohrThe **frequency** index contains tuples of entity.RIDs and usage frequencies. token.RID -> entity.RID*frequency:...
1368ae94493SAndreas Gohr
137*f2bbffb5SAndreas GohrThe **reverse** index records which tokens are assigned to each entity. Its format depends on whether the collection uses split indexes:
138*f2bbffb5SAndreas Gohr
139*f2bbffb5SAndreas Gohr  * **Split collections**: Each entry is a ''tokenLength*tokenId'' pair because the token length is needed to locate the correct split index file. Format: ''tokenLength*tokenId:tokenLength*tokenId:...''
140*f2bbffb5SAndreas Gohr  * **Non-split collections**: Only the token ID is needed since all tokens live in a single file. Format: ''tokenId:tokenId:...''
141*f2bbffb5SAndreas Gohr
142*f2bbffb5SAndreas GohrThe reverse index does not store frequencies. It is used internally when updating an entity: the old reverse record provides the set of previously assigned tokens so they can be cleaned up (see the addEntity description in the Collection Types section above).
1438ae94493SAndreas Gohr
1448ae94493SAndreas GohrDirect collections only use entity and token index files with entity.RID === token.RID
1458ae94493SAndreas Gohr
1468ae94493SAndreas Gohr
1478ae94493SAndreas Gohr==== Index File Splitting ====
1488ae94493SAndreas Gohr
149*f2bbffb5SAndreas GohrTo improve memory efficiency and access speed, a collection can split its token and frequency indexes into multiple physical files using suffixes based on token length. This is configured per collection and is independent of the collection type (frequency or lookup).
1508ae94493SAndreas Gohr
151*f2bbffb5SAndreas GohrWhen splitting is enabled, both the token index and the frequency index are distributed across files. A suffix parameter is appended to the base index name to create the actual filename. For example:
1528ae94493SAndreas Gohr
1538ae94493SAndreas Gohr  * Base name: ''w'' (for word tokens)
154*f2bbffb5SAndreas Gohr  * Suffix: ''3'' (for 3-letter tokens)
1558ae94493SAndreas Gohr  * Resulting file: ''w3.idx''
1568ae94493SAndreas Gohr
157*f2bbffb5SAndreas GohrIn a fulltext collection with splitting enabled:
1588ae94493SAndreas Gohr
159*f2bbffb5SAndreas Gohr  * ''w3.idx'' / ''i3.idx'' - stores all 3-letter tokens and their frequencies
160*f2bbffb5SAndreas Gohr  * ''w4.idx'' / ''i4.idx'' - stores all 4-letter tokens and their frequencies
161*f2bbffb5SAndreas Gohr  * ''w5.idx'' / ''i5.idx'' - stores all 5-letter tokens and their frequencies
1628ae94493SAndreas Gohr  * and so on...
1638ae94493SAndreas Gohr
164*f2bbffb5SAndreas GohrWhen splitting is disabled, a single file is used for each index (eg. ''relation_media_w.idx'').
165*f2bbffb5SAndreas Gohr
1668ae94493SAndreas GohrWhen an index uses suffixes, the ''max()'' method can be used to find the highest numeric suffix currently in use. This is useful for operations that need to iterate over all splits of an index (eg. when a Term is using a wildcard).
1678ae94493SAndreas Gohr
1688ae94493SAndreas Gohr==== Tuple Data Format ====
1698ae94493SAndreas Gohr
1708ae94493SAndreas GohrTuple-based index rows store associations between keys (typically RIDs from another index) and numeric values (typically frequency counts). The internal format uses a compact string representation:
1718ae94493SAndreas Gohr
1728ae94493SAndreas Gohr<code>
1738ae94493SAndreas Gohrkey*count:key*count:key*count
1748ae94493SAndreas Gohr</code>
1758ae94493SAndreas Gohr
1768ae94493SAndreas GohrWhere:
1778ae94493SAndreas Gohr  * ''key'' - Usually the RID from another index (e.g., a page ID)
1788ae94493SAndreas Gohr  * ''count'' - A numeric value (e.g., how many times a word appears on that page)
1798ae94493SAndreas Gohr  * '':'' - Separates individual tuples
1808ae94493SAndreas Gohr  * ''*'' - Separates the key from its count within a tuple
1818ae94493SAndreas Gohr
1828ae94493SAndreas Gohr**Example:** A frequency index row for a word might look like:
1838ae94493SAndreas Gohr<code>
1848ae94493SAndreas Gohr42*5:17*3:98*12
1858ae94493SAndreas Gohr</code>
1868ae94493SAndreas Gohr
1878ae94493SAndreas GohrThis means:
1888ae94493SAndreas Gohr  * Entity with RID 42 contains this word 5 times
1898ae94493SAndreas Gohr  * Entity with RID 17 contains this word 3 times
1908ae94493SAndreas Gohr  * Entity with RID 98 contains this word 12 times
1918ae94493SAndreas Gohr
1928ae94493SAndreas GohrFrequencies of 1 are not stored in the index. For example:
1938ae94493SAndreas Gohr
1948ae94493SAndreas Gohr<code>
1958ae94493SAndreas Gohr42*5:17:98
1968ae94493SAndreas Gohr</code>
1978ae94493SAndreas Gohr
1988ae94493SAndreas GohrIn the above case would be interpreted as
1998ae94493SAndreas Gohr
2008ae94493SAndreas Gohr  * Entity with RID 42 contains this word 5 times
2018ae94493SAndreas Gohr  * Entity with RID 17 contains this word 1 times
2028ae94493SAndreas Gohr  * Entity with RID 98 contains this word 1 times
2038ae94493SAndreas Gohr
2048ae94493SAndreas GohrThe ''TupleOps'' class provides utility methods for working with tuple records:
2058ae94493SAndreas Gohr  * ''updateTuple()'' - Insert or update a specific key->count pair
2068ae94493SAndreas Gohr  * ''parseTuples()'' - Parse a record into an array of key->count associations
2078ae94493SAndreas Gohr  * ''aggregateTupleCounts()'' - Sum all counts in a record
2088ae94493SAndreas Gohr
209596d5287SAndreas Gohr===== Locking =====
210596d5287SAndreas Gohr
211596d5287SAndreas GohrOnly one process may write to an index at any time. To ensure this, a locking mechanism has to be employed.
212596d5287SAndreas Gohr
213596d5287SAndreas GohrIndexes can be read in write or readonly mode according to the acquired locks. However, managing locks has to be done outside the index. Usually within a collection.
214596d5287SAndreas Gohr
215596d5287SAndreas GohrThe ''Lock'' class is used to acquire the needed locks.
216