1*596d5287SAndreas Gohr====== Search Indexing ====== 2*596d5287SAndreas Gohr 3*596d5287SAndreas GohrThis document is meant to describe the various concepts behind the indexing mechanism. It's work in progress an will be moved to the wiki once the new index code is merged. 4*596d5287SAndreas Gohr 5*596d5287SAndreas Gohr===== Uses ===== 6*596d5287SAndreas Gohr 7*596d5287SAndreas GohrThe indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future. 8*596d5287SAndreas Gohr 9*596d5287SAndreas Gohr===== Collections ===== 10*596d5287SAndreas Gohr 11*596d5287SAndreas GohrA collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection. 12*596d5287SAndreas Gohr 13*596d5287SAndreas GohrPlease note: because index has a specific meaning in our context (see below) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection. 14*596d5287SAndreas Gohr 15*596d5287SAndreas GohrThere are basically two different complexities of collections 16*596d5287SAndreas Gohr 17*596d5287SAndreas Gohr * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case. 18*596d5287SAndreas Gohr * lookup collections - A token only appears once per entity. Search results are interested in looking up entities only. This would be used for the meta data search for example. FIXME not implemented yet 19*596d5287SAndreas Gohr 20*596d5287SAndreas GohrA collection works on four indexes: 21*596d5287SAndreas Gohr 22*596d5287SAndreas Gohr * entity - This is the main entity that will be the result of a search. Eg. a page 23*596d5287SAndreas Gohr * token - This is the actual information strewn across the entities. Eg. words 24*596d5287SAndreas Gohr * frequency - This maps tokens to entities and records their frequency. Eg. freq(words)->page 25*596d5287SAndreas Gohr * reverse - This records which tokens are on a specific entities. This is mostly used for internal index cleanup. Eg. page-words 26*596d5287SAndreas Gohr 27*596d5287SAndreas GohrFIXME the above may not be true for all collections 28*596d5287SAndreas Gohr 29*596d5287SAndreas GohrCollections can be searched for one or more ''terms'' via a CollectionSearch class. In the most simple case a term is a single token. But it is also possible to use wildcards signifiers ''*'' at the start and end of a term. In that case, the term refers to a list of tokens. A CollectionSearch will return the frequency with which each term occurs on each entity. 30*596d5287SAndreas Gohr 31*596d5287SAndreas GohrFIXME above will most likely change once we have non-frequency collections 32*596d5287SAndreas Gohr 33*596d5287SAndreas Gohr===== Indexes ====== 34*596d5287SAndreas Gohr 35*596d5287SAndreas GohrIndexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies. 36*596d5287SAndreas Gohr 37*596d5287SAndreas GohrIndexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code. 38*596d5287SAndreas Gohr 39*596d5287SAndreas GohrIndex files can be accessed through two classes: 40*596d5287SAndreas Gohr 41*596d5287SAndreas Gohr * \dokuwiki\Search\Index\FileIndex 42*596d5287SAndreas Gohr * \dokuwiki\Search\Index\MemoryIndex 43*596d5287SAndreas Gohr 44*596d5287SAndreas GohrBoth classes expose the same API, the only difference is their way of accessing the data. 45*596d5287SAndreas Gohr 46*596d5287SAndreas GohrA FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index. 47*596d5287SAndreas Gohr 48*596d5287SAndreas GohrThe MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory. 49*596d5287SAndreas Gohr 50*596d5287SAndreas GohrWhich method to use depends mostly on the size of the file. 51*596d5287SAndreas Gohr 52*596d5287SAndreas GohrUsually indexes are not accessed directly but through a collection. That collection will manage which type of access to use. 53*596d5287SAndreas Gohr 54*596d5287SAndreas GohrWithin an index two kinds of data can be stored per row: 55*596d5287SAndreas Gohr 56*596d5287SAndreas Gohr * A single value. Eg. an entity or a token 57*596d5287SAndreas Gohr * A list of tuples. Eg. a list of pageIDs and frequencies 58*596d5287SAndreas Gohr 59*596d5287SAndreas GohrThe former is straight forward, it's a simple ''rid -> value'' store. The latter maps to ''rid -> [key -> value, ...]'' where key is usally the ''rid'' in another index. 60*596d5287SAndreas Gohr 61*596d5287SAndreas Gohr===== Locking ===== 62*596d5287SAndreas Gohr 63*596d5287SAndreas GohrOnly one process may write to an index at any time. To ensure this, a locking mechanism has to be employed. 64*596d5287SAndreas Gohr 65*596d5287SAndreas GohrIndexes can be read in write or readonly mode according to the acquired locks. However, managing locks has to be done outside the index. Usually within a collection. 66*596d5287SAndreas Gohr 67*596d5287SAndreas GohrThe ''Lock'' class is used to acquire the needed locks. 68