inc/Search/concept.txt

*596d5287SAndreas Gohr====== Search Indexing ======
*596d5287SAndreas Gohr
*596d5287SAndreas GohrThis document is meant to describe the various concepts behind the indexing mechanism. It's work in progress an will be moved to the wiki once the new index code is merged.
*596d5287SAndreas Gohr
*596d5287SAndreas Gohr===== Uses =====
*596d5287SAndreas Gohr
*596d5287SAndreas GohrThe indexing mechanism is meant to make information that is normally distributed over several locations (eg. words on pages) available through a central, faster mechanism. The primary goal is to cover fulltext search, but it is also used for other things like page meta data and possibly more in the future.
*596d5287SAndreas Gohr
*596d5287SAndreas Gohr===== Collections =====
*596d5287SAndreas Gohr
*596d5287SAndreas GohrA collection describes how data is aggregated into multiple indexes to make it accessible for a specific use case. Eg. fulltext search for page contents is a usecase covered by a collection.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrPlease note: because index has a specific meaning in our context (see below) you should avoid using that word, when you're actually talking about a collection. There is no "fulltext index" - that functionality is only achieved by using multiple indexes in a collection.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrThere are basically two different complexities of collections
*596d5287SAndreas Gohr
*596d5287SAndreas Gohr  * frequency collections - The same token can appear multiple times in the same entity and searches are usually interested in the number of times it appears. This is the words on pages use case.
*596d5287SAndreas Gohr  * lookup collections - A token only appears once per entity. Search results are interested in looking up entities only. This would be used for the meta data search for example. FIXME not implemented yet
*596d5287SAndreas Gohr
*596d5287SAndreas GohrA collection works on four indexes:
*596d5287SAndreas Gohr
*596d5287SAndreas Gohr  * entity - This is the main entity that will be the result of a search. Eg. a page
*596d5287SAndreas Gohr  * token - This is the actual information strewn across the entities. Eg. words
*596d5287SAndreas Gohr  * frequency - This maps tokens to entities and records their frequency. Eg. freq(words)->page
*596d5287SAndreas Gohr  * reverse - This records which tokens are on a specific entities. This is mostly used for internal index cleanup. Eg. page-words
*596d5287SAndreas Gohr
*596d5287SAndreas GohrFIXME the above may not be true for all collections
*596d5287SAndreas Gohr
*596d5287SAndreas GohrCollections can be searched for one or more ''terms'' via a CollectionSearch class. In the most simple case a term is a single token. But it is also possible to use wildcards signifiers ''*'' at the start and end of a term. In that case, the term refers to a list of tokens. A CollectionSearch will return the frequency with which each term occurs on each entity.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrFIXME above will most likely change once we have non-frequency collections
*596d5287SAndreas Gohr
*596d5287SAndreas Gohr===== Indexes ======
*596d5287SAndreas Gohr
*596d5287SAndreas GohrIndexes refer to individual index files that store one kind of information. E.g. a list of all page names or a list of page-word frequencies.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrIndexes are row based. The line number is important information of the index. The lines are counted from zero and referred to as ''rid'' in the code.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrIndex files can be accessed through two classes:
*596d5287SAndreas Gohr
*596d5287SAndreas Gohr  * \dokuwiki\Search\Index\FileIndex
*596d5287SAndreas Gohr  * \dokuwiki\Search\Index\MemoryIndex
*596d5287SAndreas Gohr
*596d5287SAndreas GohrBoth classes expose the same API, the only difference is their way of accessing the data.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrA FileIndex will read through the index file line-by-line without ever loading the full file into memory. Each modification will directly write back to the index.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrThe MemoryIndex loads the whole file into an internal array. Changes are only written back when explicitly calling the ''save()'' method. A memory index is faster but requires more memory.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrWhich method to use depends mostly on the size of the file.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrUsually indexes are not accessed directly but through a collection. That collection will manage which type of access to use.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrWithin an index two kinds of data can be stored per row:
*596d5287SAndreas Gohr
*596d5287SAndreas Gohr  * A single value. Eg. an entity or a token
*596d5287SAndreas Gohr  * A list of tuples. Eg. a list of pageIDs and frequencies
*596d5287SAndreas Gohr
*596d5287SAndreas GohrThe former is straight forward, it's a simple ''rid -> value'' store.  The latter maps to ''rid -> [key -> value, ...]'' where key is usally the ''rid'' in another index.
*596d5287SAndreas Gohr
*596d5287SAndreas Gohr===== Locking =====
*596d5287SAndreas Gohr
*596d5287SAndreas GohrOnly one process may write to an index at any time. To ensure this, a locking mechanism has to be employed.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrIndexes can be read in write or readonly mode according to the acquired locks. However, managing locks has to be done outside the index. Usually within a collection.
*596d5287SAndreas Gohr
*596d5287SAndreas GohrThe ''Lock'' class is used to acquire the needed locks.