xref: /dokuwiki/inc/Search/concept.txt (revision 596d5287d7a816d606ef4153ef9e0f4704bf8f73)
1*596d5287SAndreas Gohr====== Search Indexing ======
2*596d5287SAndreas Gohr
3*596d5287SAndreas GohrThis document is meant to describe the various concepts behind the indexing mechanism. It's work in progress an will be moved to the wiki once the new index code is merged.
4*596d5287SAndreas Gohr
5*596d5287SAndreas Gohr===== Uses =====
6*596d5287SAndreas Gohr
7*596d5287SAndreas GohrThe indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future.
8*596d5287SAndreas Gohr
9*596d5287SAndreas Gohr===== Collections =====
10*596d5287SAndreas Gohr
11*596d5287SAndreas GohrA collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection.
12*596d5287SAndreas Gohr
13*596d5287SAndreas GohrPlease note: because index has a specific meaning in our context (see below) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection.
14*596d5287SAndreas Gohr
15*596d5287SAndreas GohrThere are basically two different complexities of collections
16*596d5287SAndreas Gohr
17*596d5287SAndreas Gohr  * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case.
18*596d5287SAndreas Gohr  * lookup collections - A token only appears once per entity. Search results are interested in looking up entities only. This would be used for the meta data search for example. FIXME not implemented yet
19*596d5287SAndreas Gohr
20*596d5287SAndreas GohrA collection works on four indexes:
21*596d5287SAndreas Gohr
22*596d5287SAndreas Gohr  * entity - This is the main entity that will be the result of a search. Eg. a page
23*596d5287SAndreas Gohr  * token - This is the actual information strewn across the entities. Eg. words
24*596d5287SAndreas Gohr  * frequency - This maps tokens to entities and records their frequency. Eg. freq(words)->page
25*596d5287SAndreas Gohr  * reverse - This records which tokens are on a specific entities. This is mostly used for internal index cleanup. Eg. page-words
26*596d5287SAndreas Gohr
27*596d5287SAndreas GohrFIXME the above may not be true for all collections
28*596d5287SAndreas Gohr
29*596d5287SAndreas GohrCollections can be searched for one or more ''terms'' via a CollectionSearch class. In the most simple case a term is a single token. But it is also possible to use wildcards signifiers ''*'' at the start and end of a term. In that case, the term refers to a list of tokens. A CollectionSearch will return the frequency with which each term occurs on each entity.
30*596d5287SAndreas Gohr
31*596d5287SAndreas GohrFIXME above will most likely change once we have non-frequency collections
32*596d5287SAndreas Gohr
33*596d5287SAndreas Gohr===== Indexes ======
34*596d5287SAndreas Gohr
35*596d5287SAndreas GohrIndexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies.
36*596d5287SAndreas Gohr
37*596d5287SAndreas GohrIndexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code.
38*596d5287SAndreas Gohr
39*596d5287SAndreas GohrIndex files can be accessed through two classes:
40*596d5287SAndreas Gohr
41*596d5287SAndreas Gohr  * \dokuwiki\Search\Index\FileIndex
42*596d5287SAndreas Gohr  * \dokuwiki\Search\Index\MemoryIndex
43*596d5287SAndreas Gohr
44*596d5287SAndreas GohrBoth classes expose the same API, the only difference is their way of accessing the data.
45*596d5287SAndreas Gohr
46*596d5287SAndreas GohrA FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index.
47*596d5287SAndreas Gohr
48*596d5287SAndreas GohrThe MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory.
49*596d5287SAndreas Gohr
50*596d5287SAndreas GohrWhich method to use depends mostly on the size of the file.
51*596d5287SAndreas Gohr
52*596d5287SAndreas GohrUsually indexes are not accessed directly but through a collection. That collection will manage which type of access to use.
53*596d5287SAndreas Gohr
54*596d5287SAndreas GohrWithin an index two kinds of data can be stored per row:
55*596d5287SAndreas Gohr
56*596d5287SAndreas Gohr  * A single value. Eg. an entity or a token
57*596d5287SAndreas Gohr  * A list of tuples. Eg. a list of pageIDs and frequencies
58*596d5287SAndreas Gohr
59*596d5287SAndreas GohrThe former is straight forward, it's a simple ''rid -> value'' store.  The latter maps to ''rid -> [key -> value, ...]'' where key is usally the ''rid'' in another index.
60*596d5287SAndreas Gohr
61*596d5287SAndreas Gohr===== Locking =====
62*596d5287SAndreas Gohr
63*596d5287SAndreas GohrOnly one process may write to an index at any time. To ensure this, a locking mechanism has to be employed.
64*596d5287SAndreas Gohr
65*596d5287SAndreas GohrIndexes can be read in write or readonly mode according to the acquired locks. However, managing locks has to be done outside the index. Usually within a collection.
66*596d5287SAndreas Gohr
67*596d5287SAndreas GohrThe ''Lock'' class is used to acquire the needed locks.
68