| #
2d02fff5 |
| 19-Jan-2026 |
Andreas Gohr <gohr@cosmocode.de> |
avoid deleting non-existant chunks. fixes #46
|
| #
ae2d01b1 |
| 06-Oct-2025 |
Andreas Gohr <andi@splitbrain.org> |
Merge branch 'sentencesplit' into partner
* sentencesplit: add tests for text splitting make overlap a class member for easier testing Agents: make clearer how to run tests added an AGENTS.m
Merge branch 'sentencesplit' into partner
* sentencesplit: add tests for text splitting make overlap a class member for easier testing Agents: make clearer how to run tests added an AGENTS.md file for LLM based work split sentences by token, not bytes. handle UTF-8 move text splitting into it's own class Some enhancements on the subsentence splitting Squashed commit of the following:
show more ...
|
| #
072e0099 |
| 06-Oct-2025 |
Andreas Gohr <gohr@cosmocode.de> |
move text splitting into it's own class
|
| #
3daef465 |
| 06-Oct-2025 |
Andreas Gohr <gohr@cosmocode.de> |
Some enhancements on the subsentence splitting
When a sentence is longer than a chunk, it should be split forcefully in smaller parts - these parts should NOT be the size of a full chunk since we st
Some enhancements on the subsentence splitting
When a sentence is longer than a chunk, it should be split forcefully in smaller parts - these parts should NOT be the size of a full chunk since we still want to do some overlap with previous and following texts. I chose to split into a quarter of a chunk.
This also ensures that whitespace is kept for the split sentences, because they may be joined with follow up texts.
show more ...
|
| #
867b7752 |
| 06-Oct-2025 |
Henry <henry.krupp@gmail.com> |
Squashed commit of the following:
commit 4e0adf2a8d810e55db6d37ccc87c76d95ddcfd8d Author: Henry <henry.krupp@gmail.com> Date: Mon Feb 3 22:04:17 2025 +0100
Updated splitLongSentence()
commit
Squashed commit of the following:
commit 4e0adf2a8d810e55db6d37ccc87c76d95ddcfd8d Author: Henry <henry.krupp@gmail.com> Date: Mon Feb 3 22:04:17 2025 +0100
Updated splitLongSentence()
commit 9883844f1db6df9e11051c4c7b034e68baaca0be Author: Henry <henry.krupp@gmail.com> Date: Mon Feb 3 22:03:25 2025 +0100
Updated splitLongSentence()
commit 6f737f6fe4da25fa438211d5c00605c2df9c81ba Author: Henry <henry.krupp@gmail.com> Date: Mon Feb 3 21:43:16 2025 +0100
array_unshift($sentences, ...$this->splitLongSentence($sentence, $tiktok));
commit 21966eab02f87f632e82ad0055f9bc2aadb92053 Author: Henry <henry.krupp@gmail.com> Date: Mon Feb 3 21:23:40 2025 +0100
Updated splitIntoChunks method
Push split sentences to the front of the queue with array_unshift($sentences, ...$this->splitLongSentence($sentence, $tiktok));
show more ...
|
| #
9634d734 |
| 21-May-2025 |
Andreas Gohr <gohr@cosmocode.de> |
add option to always send full page context
|
| #
7be8078e |
| 15-Apr-2025 |
Andreas Gohr <andi@splitbrain.org> |
allow models to have a zero token limit
This allows for configuring completely unknown models. For these models no token limit is known and we will simply do not apply any. Instead we trust that the
allow models to have a zero token limit
This allows for configuring completely unknown models. For these models no token limit is known and we will simply do not apply any. Instead we trust that the model will be either large enough to handle our input or at least throw useful error messages.
show more ...
|
| #
ed47fd87 |
| 27-Mar-2025 |
Andreas Gohr <andi@splitbrain.org> |
new UI with option to chat about the current page
|
| #
aa6bbe75 |
| 12-Mar-2025 |
Andreas Gohr <andi@splitbrain.org> |
added "similar" endpoint to the remote api
|
| #
c2f55081 |
| 22-Jul-2024 |
Andreas Gohr <andi@splitbrain.org> |
show used query when doing similarity queries
|
| #
661701ee |
| 25-Jun-2024 |
Andreas Gohr <andi@splitbrain.org> |
Use custom renderer when creating embeddings
Rendering makes plugin output available and and handles includes. It might also help with #15. The renderer uses markdown like output since all LLMs seem
Use custom renderer when creating embeddings
Rendering makes plugin output available and and handles includes. It might also help with #15. The renderer uses markdown like output since all LLMs seem to be very familiar with it's syntax. This might help them to understand the document structure better. This also adds a breadcrumb trail at the top of each chunk which might help with contextulization as well.
show more ...
|
| #
303d0c59 |
| 17-Jun-2024 |
Andreas Gohr <andi@splitbrain.org> |
gracefully handle render errors
plugins may act up during text rendering, this should not abort the whole indexing. Instead we fall back to the page source
|
| #
8c08cb3f |
| 27-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
auto style fixes
|
| #
ab1f8dde |
| 26-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
emit the INDEXER_PAGE_ADD event
This allows plugins that add data to the fulltext index to add the same data to the embeddings. This improves embedding searches with struct data for example.
|
| #
720bb43f |
| 25-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
make threshold configurable
|
| #
2071dced |
| 21-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
automatic stylefixes
|
| #
5f71c9bb |
| 21-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
small adjustments
|
| #
c2b7a1f7 |
| 21-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
various refactoring and introduction of a simulate command
The new command makes it easier to run the same chat questions against multiple models and compare the results in a spreadsheet
|
| #
ecb0a423 |
| 19-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
do not hardcode dimensions in qdrant storage
|
| #
e3640be8 |
| 19-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
clean up of the config options
Emojis are used to make the different options easier to distinguish
|
| #
34a1c478 |
| 19-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
more refactoring on chat and embed model support
* differentiate between input and output tokens * make use of much larger input contexts
|
| #
294a9eaf |
| 18-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
Use interfaces for Chat and Embedding classes
This way it's easier to have a base OpenAI class. This also moves much of the statistics and http handling into the base class making model implementati
Use interfaces for Chat and Embedding classes
This way it's easier to have a base OpenAI class. This also moves much of the statistics and http handling into the base class making model implementations even leaner
show more ...
|
| #
6a18e0f4 |
| 14-Mar-2024 |
Andreas Gohr <andi@splitbrain.org> |
First start on refactoring the class hierarchy
This splits embedding models from chat completion models.
|
| #
d5c102b3 |
| 29-Jan-2024 |
Andreas Gohr <andi@splitbrain.org> |
Regular expressions to limit the indexed pages. Implements #5
Both regular expressions (when set) need to apply at the same time. Eg a page MUST match the matchRegex and MUST NOT match the skipRegex
Regular expressions to limit the indexed pages. Implements #5
Both regular expressions (when set) need to apply at the same time. Eg a page MUST match the matchRegex and MUST NOT match the skipRegex to be applied.
The regular expressions are applied when running the `embed` command line command. Pages no longer adhering to a changed regex setup will be removed from the vector store.
For the sqlite storage it is recommended to re-cluster the index when the reges are changed by running the `maintenance` command.
show more ...
|
| #
30b9cbc7 |
| 08-Nov-2023 |
splitbrain <splitbrain@users.noreply.github.com> |
Automatic code style fixes
|