| 53307a6b | 09-May-2026 |
Andreas Gohr <andi@splitbrain.org> |
Delete inc/Search/concept.txt
The contents have been added to the wiki |
| 8788dbbd | 06-May-2026 |
splitbrain <86426+splitbrain@users.noreply.github.com> |
Rector and PHPCS fixes |
| 4f29a5b9 | 06-May-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: fix comment position
single line comment moved to the wrong line on reformatting |
| 06053dca | 10-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: remove write side effect from retrieveRow()
retrieveRow() padded the index file when the requested RID was beyond the current length. This was an optimization for subsequent changeRow()
SearchIndex: remove write side effect from retrieveRow()
retrieveRow() padded the index file when the requested RID was beyond the current length. This was an optimization for subsequent changeRow() calls, but changeRow() already handles padding on its own. The side effect was also inconsistent with retrieveRows() which is a pure read.
Added a cross-index integration test verifying RID consistency across entity, token, frequency and reverse indexes when multiple entities share tokens.
show more ...
|
| 5d034a75 | 08-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: increase index version |
| 9369b4a9 | 08-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: rector, phpcs, type hint fixes |
| db8be586 | 08-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: review fixes — auto-save MemoryIndex, cast TupleOps counts, style cleanups
- MemoryIndex: auto-save dirty data on unlock/destruction to prevent silent index corruption when indexes ar
SearchIndex: review fixes — auto-save MemoryIndex, cast TupleOps counts, style cleanups
- MemoryIndex: auto-save dirty data on unlock/destruction to prevent silent index corruption when indexes are used in tandem - TupleOps::parseTuples(): cast exploded count strings to int - FileIndex::retrieveRow(): document the write-on-read padding behavior - Fix whitespace issues in ApiCore, common.php, Sitemap/Mapper - Update concept.txt to reflect MemoryIndex auto-save behavior
show more ...
|
| 2a22d4b9 | 08-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: document Tokenizer::isValidSearchTerm() in concept.txt |
| 1148921d | 08-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: unify CollectionSearch API and optimize search pipeline
- Remove separate lookup() API from CollectionSearch. All searches now use addTerm()/execute() with a single unified pipeline.
SearchIndex: unify CollectionSearch API and optimize search pipeline
- Remove separate lookup() API from CollectionSearch. All searches now use addTerm()/execute() with a single unified pipeline. - Add matches() predicate to Term using efficient string functions (===, str_starts_with, str_ends_with, str_contains) instead of regex. - Add caseInsensitive() support on CollectionSearch and Term for metadata/title searches where indexed values preserve case. - Remove callback support from MetadataSearch::lookupKey() — the only real usage (case-insensitive substring) is replaced by caseInsensitive() + wildcards. - Remove min-length validation from Term. Add Tokenizer::isValidSearchTerm() for callers that need it (FulltextSearch, Indexer::lookup). - Optimize execute() from 4 group passes to 2: scan tokens + resolve frequencies in one pass per group, batch entity name resolution, then populate Terms. - Store full match detail in Term: entity → token → frequency. New accessors getMatches(), getEntityTokens(), getEntityFrequencies() derive different views from this single data structure. - Term no longer used as scratch pad by CollectionSearch. Index-internal data (token IDs, entity IDs) stays local to execute(). Terms receive only final resolved results. - Use title from search results in MetadataSearch::pageLookupCallBack() instead of re-fetching via p_get_first_heading(). - Update concept.txt documentation.
show more ...
|
| b9d7a615 | 07-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: updated documentation
to be moved into the wiki later |
| 0b52f0de | 07-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: use FileIndex for title token index
PageTitleCollection accesses titles by RID (one line at a time), so loading the entire index into memory is wasteful. Override getTokenIndex() to ret
SearchIndex: use FileIndex for title token index
PageTitleCollection accesses titles by RID (one line at a time), so loading the entire index into memory is wasteful. Override getTokenIndex() to return a FileIndex, matching the line-by-line access pattern used on master.
show more ...
|
| e1272c08 | 07-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: add backward compatibility wrappers
Add deprecated wrappers for idx_* and ft_* functions that were removed when inc/indexer.php and inc/fulltext.php were replaced by the new Search clas
SearchIndex: add backward compatibility wrappers
Add deprecated wrappers for idx_* and ft_* functions that were removed when inc/indexer.php and inc/fulltext.php were replaced by the new Search classes. These wrappers delegate to the new architecture and ensure existing plugins continue to work.
Deprecated standalone functions: idx_get_indexer, idx_getIndex, idx_lookup, idx_listIndexLengths, idx_indexLengths, ft_pageSearch, ft_backlinks, ft_mediause, ft_pageLookup, ft_snippet, ft_pagesorter, ft_snippet_re_preprocess, ft_queryParser.
Deprecated methods on Indexer: lookupKey, getPages, addMetaKeys, renameMetaValue, getPID, lookup.
Also migrates remaining core callers (Ajax, FeedCreator, ApiCore) to use the new classes directly and fixes a UTF-8 case folding bug in MetadataSearch title lookups.
show more ...
|
| 74a9499c | 07-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: remove legacy intermediate classes from PR #2943
Remove FulltextIndex, MetadataIndex, and the old AbstractIndex which were introduced as a stepping stone in #2943. All callers now use t
SearchIndex: remove legacy intermediate classes from PR #2943
Remove FulltextIndex, MetadataIndex, and the old AbstractIndex which were introduced as a stepping stone in #2943. All callers now use the Collection/Index architecture directly.
Also fix a bug in detail.php where mediause() was called with ignore_perms=true, leaking references from hidden/protected pages to unprivileged users. This bug existed on master as well.
Old test files replaced by their modernized equivalents in tests/Search/.
show more ...
|
| 21fbd01b | 07-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: add integrity checking to Collection architecture
Add checkIntegrity() to AbstractCollection and DirectCollection that verifies paired indexes have matching line counts (token==frequenc
SearchIndex: add integrity checking to Collection architecture
Add checkIntegrity() to AbstractCollection and DirectCollection that verifies paired indexes have matching line counts (token==frequency, entity==reverse, entity==token for direct collections). Throws IndexIntegrityException on the first inconsistency found.
Add Countable interface to AbstractIndex with count() implementations in MemoryIndex and FileIndex. Add Indexer::checkIntegrity() and Indexer::isIndexEmpty() to orchestrate checks across all collections.
Update infoutils.php to use the new Indexer API instead of the old FulltextIndex/MetadataIndex classes.
Fix range(1, 0) bug in three places that produced [1, 0] instead of an empty array when split-by-length indexes were empty.
show more ...
|
| 6734bb8c | 07-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: rewrite MetadataSearch to use Collection classes
Replace MetadataIndex usage in MetadataSearch with the new Collection/Index architecture. This completes the read-path migration so data
SearchIndex: rewrite MetadataSearch to use Collection classes
Replace MetadataIndex usage in MetadataSearch with the new Collection/Index architecture. This completes the read-path migration so data written by the Collection-based Indexer is read back correctly using TupleOps tuple format.
Generalize FrequencyCollectionSearch into CollectionSearch that works with any AbstractCollection type (Frequency, Lookup, Direct) and handles both split-by-length and non-split index layouts transparently. DirectCollection participates via resolveTokenFrequencies() which maps token RID = entity RID.
Key changes: - AbstractCollection gains isSplitByLength(), resolveTokenFrequencies(), getEntitiesWithData(), and groupToSuffix() with validation - Index groups are now int (0 = non-split, positive = token length) - CollectionSearch provides both addTerm()/execute() for fulltext and lookup() for metadata-style search (exact/wildcard/callback) - MetadataSearch delegates entirely to collection APIs - Shared filterPages() replaces duplicated page filtering logic - All callers updated from MetadataIndex to MetadataSearch - Tests moved to Search namespace with full coverage for new APIs
show more ...
|
| ede46466 | 06-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: reorganize and expand test suite
Move all Search tests from _test/tests/inc/Search/ to _test/tests/Search/ to match the dokuwiki\test autoloader convention. Fix namespaces from tests\*
SearchIndex: reorganize and expand test suite
Move all Search tests from _test/tests/inc/Search/ to _test/tests/Search/ to match the dokuwiki\test autoloader convention. Fix namespaces from tests\* to dokuwiki\test\* so all tests work in isolation.
Extract inline test helpers into separate autoloadable mock files: TestDirectCollection → MockDirectCollection, TestLookupCollection → MockLookupCollection, TestFrequencyCollection → MockFrequencyCollection.
Rename AbstractIndexTest → AbstractIndexTestCase to fix PHPUnit warning about abstract classes with Test suffix.
Replace dead xxxRealWord() with proper testWildcardSearch() verifying exact token matches and frequencies for all three wildcard types. Add testTokenizedPageSearch() using a dedicated test data file. Add testNoMatchReturnsEmptyFrequencies() which exposed a bug in Term where uninitialized $tokens/$frequencies caused crashes on zero-match terms.
Replace fulltext_query.test.php with modern QueryParserTest in the Search\Query namespace.
Add new test files: - LockTest: acquire/release, reference counting, stale lock override, foreign lock rejection, releaseAll, independent locks - NamespacePredicateTest: filter/exclude, sub-namespaces, partial prefix safety, empty sets, score preservation - PageSetTest: intersect, unite, subtract, isEmpty - QueryEvaluatorTest: word lookups, AND/OR/NOT, namespace filtering, combined queries, partial namespace prefix safety
Fix Term.php: initialize $tokens and $frequencies to [] instead of null.
show more ...
|
| 0b1bbbbb | 06-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: rewrite FulltextSearch to use FrequencyCollectionSearch
Replace FulltextIndex->lookupWords() with FrequencyCollectionSearch which correctly handles the compact tuple format written by t
SearchIndex: rewrite FulltextSearch to use FrequencyCollectionSearch
Replace FulltextIndex->lookupWords() with FrequencyCollectionSearch which correctly handles the compact tuple format written by the new Indexer.
Introduce QueryEvaluator with typed stack entries (PageSet, NamespacePredicate, NegatedEntry) for RPN query evaluation. NOT wraps its operand instead of computing a universe complement, so AND with a negated operand becomes efficient set subtraction. The full page index is only loaded for standalone negative or namespace-only queries.
Move QueryParser and QueryEvaluator into the new Search\Query namespace along with the stack entry types.
Simplify FulltextSearch to orchestration: parse query, look up words, evaluate, filter, sort. Replace FT_SNIPPET_NUMBER constant with maxSnippets property. Combine ACL/existence/time filtering into a single pass.
show more ...
|
| 83b3accc | 06-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: rewrite Indexer to use Collection classes
Replace the intermediate #2943 classes (FulltextIndex, MetadataIndex) with the new Collection-based architecture. The Indexer is now a thin sta
SearchIndex: rewrite Indexer to use Collection classes
Replace the intermediate #2943 classes (FulltextIndex, MetadataIndex) with the new Collection-based architecture. The Indexer is now a thin stateless orchestrator that delegates all index work to collections.
Key changes: - Indexer no longer extends AbstractIndex; page name passed to methods - addPage/deletePage/clear use PageTitleCollection, PageFulltextCollection, and PageMetaCollection - New PageMetaCollection replaces separate ReferencesCollection and MediaCollection with a single class that handles arbitrary metadata keys dynamically - Shared writable FileIndex('page') passed to all collections - Logger callback replaces verbose parameter - Methods return void instead of bool - Index classes implement IteratorAggregate for clean data access - Indexer tests consolidated into namespaced IndexerTest.php - All callers updated to new stateless API
show more ...
|
| 95b16223 | 05-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: accept pre-instantiated entity and token indexes in collections
Allow passing AbstractIndex objects for the entity and token parameters instead of string names. This enables sharing ind
SearchIndex: accept pre-instantiated entity and token indexes in collections
Allow passing AbstractIndex objects for the entity and token parameters instead of string names. This enables sharing index instances between collections for efficiency.
show more ...
|
| c66b5ec6 | 05-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: rewrite Lock as static registry with reference counting
Replace the instance-based Lock class with a static registry that tracks held locks per-process with reference counting. This sol
SearchIndex: rewrite Lock as static registry with reference counting
Replace the instance-based Lock class with a static registry that tracks held locks per-process with reference counting. This solves three problems:
- Split indexes (w3, w4, ...) share a single lock name and now coordinate naturally via the registry - Multiple callers can acquire the same lock without conflict - Indexes enforce their own writability through lock()/unlock() methods on AbstractIndex
The Lock registry manages both the filesystem lock (mkdir) and the in-process tracking. The first acquire creates the directory, subsequent acquires increment the refcount. Release decrements, and only removes the directory when the count reaches zero.
Note: I am not sure if implementing this as a static object is a great idea or if we should pass an instance through the collection to the indexes...
show more ...
|
| 0a9fafed | 05-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: fix lock() releasing foreign locks on partial failure
Track successfully acquired locks in $lockedIndexes so that unlock() only releases locks this collection actually holds. Previously
SearchIndex: fix lock() releasing foreign locks on partial failure
Track successfully acquired locks in $lockedIndexes so that unlock() only releases locks this collection actually holds. Previously, a failed lock acquisition would call unlock() which released all index locks including ones never acquired, potentially releasing locks held by other processes.
show more ...
|
| d92c078c | 05-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: add DirectCollection for 1:1 entity-token mappings
Introduce DirectCollection as a third collection type alongside FrequencyCollection and LookupCollection. Direct collections store exa
SearchIndex: add DirectCollection for 1:1 entity-token mappings
Introduce DirectCollection as a third collection type alongside FrequencyCollection and LookupCollection. Direct collections store exactly one token per entity at the entity's position in the token index (entity.RID === token.RID), with no frequency or reverse indexes.
AbstractCollection now accepts optional frequency/reverse index names (default to '') and skips locking empty index names.
Adds PageTitleCollection as the first concrete direct collection for the page -> title mapping.
show more ...
|
| f2bbffb5 | 05-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
SearchIndex: extract Collection base class hierarchy
Introduce AbstractCollection as the shared base for all index collections, with FrequencyCollection and LookupCollection as the two abstract subc
SearchIndex: extract Collection base class hierarchy
Introduce AbstractCollection as the shared base for all index collections, with FrequencyCollection and LookupCollection as the two abstract subclasses differing only in how tokens are counted (frequency vs dedup).
Key design decisions: - splitByLength is a constructor parameter on AbstractCollection controlling whether token/frequency indexes use length-based file splitting. This is independent of the collection type. - The reverse index format is self-describing: entries with * have a group prefix (split), entries without don't (non-split). No branching needed in parse/format methods. - addEntity, resolveTokens, updateIndexes, and reverse index handling all live in AbstractCollection. Subclasses only implement countTokens().
Concrete collections: PageFulltextCollection (frequency, split), MediaCollection and ReferencesCollection (lookup, non-split).
Renames FulltextCollection -> PageFulltextCollection and FulltextCollectionSearch -> FrequencyCollectionSearch.
show more ...
|
| 8ae94493 | 30-Oct-2025 |
Andreas Gohr <gohr@cosmocode.de> |
update SearchIndex concept doc |
| fb5311ec | 30-Oct-2025 |
Andreas Gohr <gohr@cosmocode.de> |
SearchIndex: RID cache should not be static
A static var interferes when the same class is instantiated multiple times |