| b1c59bed | 23-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
add GfmCode / GfmFile for fenced code blocks
GfmCode (backticks) emits the `code` handler instruction; GfmFile (tildes) emits `file`. Column-0 fences only, no length pairing between opener and close
add GfmCode / GfmFile for fenced code blocks
GfmCode (backticks) emits the `code` handler instruction; GfmFile (tildes) emits `file`. Column-0 fences only, no length pairing between opener and closer, and unclosed fences stay literal — matching DokuWiki's `<code>` tag convention. The info string accepts DW's full attribute vocabulary (language, filename, [options]) through a new shared `Helpers::parseCodeAttributes` that `Code` also uses, with `html` aliased to `html4strict` and `-` meaning "no language".
Preformatted's indent threshold is now preference-gated: 2 spaces in DW-preferred settings, 4 spaces in MD-preferred, matching GFM's indented code block rule. A single tab is a trigger in both.
show more ...
|
| 3440a8c0 | 22-Apr-2026 |
Andreas Gohr <gohr@cosmocode.de> |
add GfmMedia and extend GfmLink with image-as-label form
- New GfmMedia parses `` with the full DokuWiki media-parameter vocabulary in the URL slot (?100x200, ?right, ?nolink, ?recache,
add GfmMedia and extend GfmLink with image-as-label form
- New GfmMedia parses `` with the full DokuWiki media-parameter vocabulary in the URL slot (?100x200, ?right, ?nolink, ?recache, …). Adds `?left`/`?right`/`?center` align keywords shared with DW `{{…}}` — gives pure-Markdown users a way to align inline images. - GfmLink now also matches `[](target)` — the GFM equivalent of `[[target|{{img}}]]`. Detection is post-entry, mirroring Internallink's `^{{…}}$` check; one mode covers the whole family. - LinkDispatch trait replaced by Helpers::classifyLink and Helpers::parseMediaParameters — two pure static methods, shared by DW and GFM counterparts. - Entry patterns for GfmLink / GfmMedia simplified (permissive URL slot, handle-time parsing), following DW's Internallink style. - GfmSpecTest drives a test-only SpecCompatRenderer that emits bare <img> / <a> instead of DW's wiki-wrapped HTML, recovering 13 spec tests that previously failed/skipped only because of renderer shape.
show more ...
|
| e89aeebd | 22-Apr-2026 |
Andreas Gohr <gohr@cosmocode.de> |
add GfmLink for GFM inline links `[text](url)`
Extracts the URL-classification ladder from Internallink into a LinkDispatch trait so both modes route identically across all six DokuWiki link flavors
add GfmLink for GFM inline links `[text](url)`
Extracts the URL-classification ladder from Internallink into a LinkDispatch trait so both modes route identically across all six DokuWiki link flavors (internal, external, interwiki, email, windowsshare, local anchor). GfmLink parses the `[text](url)` form with optional `"title"` / `'title'` and hands the URL to the trait. The GFM title attribute is discarded — DokuWiki link instructions have no slot for it.
show more ...
|
| 8719732d | 22-Apr-2026 |
Andreas Gohr <gohr@cosmocode.de> |
add GfmHeader for ATX headings (`# text` through `###### text`)
Opener must sit at column 0. GFM tolerates 0-3 spaces before the `#` but that collides with DokuWiki's 2-space-indent preformatted blo
add GfmHeader for ATX headings (`# text` through `###### text`)
Opener must sit at column 0. GFM tolerates 0-3 spaces before the `#` but that collides with DokuWiki's 2-space-indent preformatted block, so the tolerance is dropped rather than plumbed across modes.
Widen the XHTML renderer's section-node tracker from 5 slots to 6 so h6 doesn't hit "Undefined array key 5". Extend GfmSpecTest's HTML normalizer to strip DokuWiki's section-div wrappers, section-edit comments, and header id/class attributes so heading spec examples can validate semantic correctness.
show more ...
|
| 8ed75a23 | 22-Apr-2026 |
Andreas Gohr <gohr@cosmocode.de> |
add GfmBacktickSingle / GfmBacktickDouble for GFM inline code spans
Two new inline formatting modes covering GFM code spans in their n=1 and n=2 forms:
GfmBacktickSingle `text` → <code>text<
add GfmBacktickSingle / GfmBacktickDouble for GFM inline code spans
Two new inline formatting modes covering GFM code spans in their n=1 and n=2 forms:
GfmBacktickSingle `text` → <code>text</code> GfmBacktickDouble ``text`` → <code>text</code>
Both emit monospace_open and monospace_close around an unformatted() call (the same instruction shape as DokuWiki's two-single-quote pair wrapping a nowiki span), so renderers that distinguish verbatim text from plain cdata — metadata, indexer, non-XHTML backends — treat the body as literal.
GfmBacktickDouble extends GfmBacktickSingle to reuse handle() and the body-normalization helper; only the delimiter length and the body character class differ. Both share sort 165 and gate on Markdown being loaded.
Design notes:
* The lexer has no backreferences, so each length is its own mode. Length-boundary guards (?<!`)...(?!`) on every opener and closer ensure a run of two-or-more backticks is never read as an n=1 delimiter and a run of three-or-more is never read as n=2. The two modes never steal each other's input regardless of registration order — sort can't reach this kind of cross-position constraint.
* Edge-whitespace handling and newline normalization live in handle(), not in the regex. On DOKU_LEXER_UNMATCHED the body is normalized: 1. CR/LF and LF become single spaces (GFM line-ending rule). 2. If the body starts and ends with a space and is not entirely whitespace, one space is stripped from each end. That produces the right GFM output for the tricky cases without special-casing the entry pattern: ` ` → <code> </code> (all-whitespace, no strip) ` a` → <code> a</code> (asymmetric, no strip) ` `` ` → <code>``</code> (interior run-of-2 + strip) ``foo`bar`` → <code>foo`bar</code>
* Body character classes admit exactly the runs that cannot be valid closers for this mode's length: n=1 allows `[^`] | ``+`, n=2 allows `[^`] | `(?!`)`. That is what lets a single-backtick span contain a pair and a double-backtick span contain a lone backtick.
* allowedModes is empty — no other inline parsing runs inside a span.
Deliberately not implemented, with skip.php entries explaining why:
351 — code-span precedence over emphasis (*foo`*` expected to render as *foo<code>*</code>). Cross-positional: the single-pass lexer matches leftmost-first and cannot reject an earlier emphasis opener because a later backtick span would consume its closer. A proper fix would need a pre-scan pass; sort values only break ties at the same position. 353 — the trailing " outside the code span gets converted to a curly quote by DokuWiki typography, diverging from spec HTML. 354 — raw HTML tag pass-through; DokuWiki does not render raw HTML by default. 356 — GFM angle-bracket autolink <http://…>: not implemented.
Per-mode unit tests cover basic matching, flanking via the length- boundary guards, interior-run support in the body, edge-space stripping, newline normalization, all-whitespace bodies, paragraph- boundary rejection, content-is-literal, and sort values. ModeRegistryTest's gating data provider picks up both modes.
Net effect on GfmSpecTest: eleven previously-red code-span examples now pass (339, 340, 341, 342, 344, 345, 346, 347, 349, 350, 357, 359 — the simple pairs, edge-space, interior-run, newline-normalization, and mismatched-run cases). Four skipped. Three remain pending outside the code-span scope (emphasis interactions that need GfmLink once that lands).
show more ...
|
| 2bb62bca | 20-Apr-2026 |
Andreas Gohr <gohr@cosmocode.de> |
add GFM em-wrapping-strong modes for `***foo***` / `___foo___`
Two new inline formatting modes that render triple-delimiter runs as em wrapping strong:
GfmEmphasisStrong `***text***`
add GFM em-wrapping-strong modes for `***foo***` / `___foo___`
Two new inline formatting modes that render triple-delimiter runs as em wrapping strong:
GfmEmphasisStrong `***text***` → <em><strong>text</strong></em> GfmEmphasisStrongUnderscore `___text___` → same (MD-preferred only)
Only the exact 3+3 symmetric case is handled. The other long-run and asymmetric variants (4+4, 5+5, `***foo**`, etc.) require CommonMark's stack-based delimiter-pairing algorithm with its flanking and multiple-of-3 rules, which is explicitly out of scope; those examples stay skipped in gfm-spec/skip.php.
Implementation notes:
* Patterns enforce exact 3+3 via `(?<!\*)` / `(?<!_)` lookbehinds (preventing entry at the second `*` of a `****...` run) and `(?!\*)` / `(?!_)` lookaheads after the closing triple (rejecting `***foo****` etc.). Combined with the existing non-whitespace adjacency lookaheads, all asymmetric cases cleanly fall through to other modes or stay literal.
* GfmEmphasisStrong overrides handle() to emit two instructions on entry (emphasis_open + strong_open) and two on exit (strong_close + emphasis_close). GfmEmphasisStrongUnderscore inherits that handler — only delimiters and word-boundary rules differ.
* Sort 65 — below Strong (70) and GfmEmphasis (80) so the em+strong modes win the lexer race for `***`/`___` runs. Underscore variant is MD-preferred-only, matching the existing gating of GfmEmphasisUnderscore and GfmStrongUnderscore.
Per-mode unit tests cover basic matching, single-char bodies, whitespace flanking rejection, paragraph-boundary rejection, longer-run rejection, asymmetric rejection, multibyte intraword protection, and sort values. ModeRegistryTest's gating data provider picks up the two new rules.
Net effect on GfmSpecTest: example #476 (`***foo***`) now passes; 473/474/475/477 remain skipped as documented in skip.php.
show more ...
|
| bcefb8ae | 20-Apr-2026 |
Andreas Gohr <gohr@cosmocode.de> |
add GFM emphasis and underscore-delimited strong modes
Three new inline formatting modes for GitHub Flavored Markdown:
GfmEmphasis `*text*` → <em> GfmEmphasisUnderscore `_text_`
add GFM emphasis and underscore-delimited strong modes
Three new inline formatting modes for GitHub Flavored Markdown:
GfmEmphasis `*text*` → <em> GfmEmphasisUnderscore `_text_` → <em> (MD-preferred only) GfmStrongUnderscore `__text__` → <strong> (MD-preferred only)
All three emit the same handler instructions as DokuWiki's Emphasis / Strong, so existing renderers need no changes.
Design notes:
* Lexer mode names use snake_case (gfm_emphasis, gfm_emphasis_underscore, gfm_strong_underscore) to keep PascalCase readable at the class level. The asterisk variant emits `emphasis_open`/`emphasis_close` via the getInstructionName() hook, so DW's Emphasis (`//...//`) and GfmEmphasis (`*...*`) can coexist in mixed modes without a lexer state collision while still producing the same <em> output.
* Underscore variants gate on Markdown-preferred syntax (`markdown`, `md+dw`) because `__` otherwise means DW underline. GfmStrongUnderscore sorts at 70 (matching Strong) — below Underline at 90 — so when loaded it wins the lexer race for `__` runs. Underline is already gated out of MD-preferred modes in the previous commit.
* Entry patterns enforce the simplified CommonMark flanking rules already shared across DW inline modes (non-whitespace adjacency, no paragraph-boundary crossing) plus the word-boundary check for underscore variants using NO_WORD_BEFORE / NO_WORD_AFTER. The positive non-word-char enumeration makes them multibyte-safe without requiring the `u` flag: `für_etwas` and `пристаням_стремятся_` correctly stay literal.
Per-mode unit tests cover basic matching, single-char bodies, leading/trailing-whitespace rejection, empty-delimiter rejection, paragraph-boundary rejection, multibyte intraword protection, and sort values. ModeRegistryTest's gating data provider picks up the three new rules.
show more ...
|
| 6b33ca93 | 20-Apr-2026 |
Andreas Gohr <gohr@cosmocode.de> |
add regex-primitive constants and getInstructionName() hook
Preparatory refactor for the upcoming GFM parser modes. No behaviour change for any existing mode: CONTENT_UNTIL_PARA still evaluates to t
add regex-primitive constants and getInstructionName() hook
Preparatory refactor for the upcoming GFM parser modes. No behaviour change for any existing mode: CONTENT_UNTIL_PARA still evaluates to the same regex (now factored through NOT_AT_PARA_BREAK), and getInstructionName() defaults to getModeName() so all current AbstractFormatting subclasses emit the same handler instructions as before.
AbstractMode gains four new shared regex constants:
NOT_AT_PARA_BREAK — zero-width assertion: current position is not the start of a paragraph break (blank line). Extracted from CONTENT_UNTIL_PARA for reuse in patterns that need a custom body char class.
NON_WORD_CHAR — char class: ASCII whitespace or ASCII punctuation except `_`. Multibyte-safe by construction: UTF-8 continuation bytes are >= 0x80 and thus fall outside every ASCII class, so checking positively that the surrounding context IS a non-word char correctly treats multibyte letters as word-like. No `u` flag required.
NO_WORD_BEFORE — zero-width: preceded by NON_WORD_CHAR or at start-of-input/line. For intraword-aware openers.
NO_WORD_AFTER — zero-width: followed by NON_WORD_CHAR or at end-of-input. Complement of NO_WORD_BEFORE.
AbstractFormatting gains a getInstructionName() hook that defaults to getModeName(). Subclasses that want to emit handler instructions under a different name than their lexer mode name (so a Gfm mode can share DW's `emphasis_open`/`strong_open` instructions while registering its own lexer state) override this method.
show more ...
|
| c3755410 | 20-Apr-2026 |
Andreas Gohr <gohr@cosmocode.de> |
require non-whitespace adjacency for inline formatting delimiters
An opening delimiter must now be followed by a non-whitespace character, and a closing delimiter must be preceded by one. Empty deli
require non-whitespace adjacency for inline formatting delimiters
An opening delimiter must now be followed by a non-whitespace character, and a closing delimiter must be preceded by one. Empty delimiter pairs (****, ____, '''', <sub></sub>, <sup></sup>, <del></del>) no longer match and stay literal.
Rationale: this matches Markdown's flanking-delimiter rules and eliminates accidental bolding of sequences like `** note**` at the start of a sentence. Well-formed uses (**bold**, //italic//, __underline__) are unchanged.
Affected modes: Strong, Emphasis, Underline, Monospace, Subscript, Superscript, Deleted.
BREAKING: content that was already malformed but previously rendered as formatted (e.g. `**foo bar **`) now stays literal.
show more ...
|
| 10fb3d65 | 20-Apr-2026 |
Andreas Gohr <gohr@cosmocode.de> |
prevent inline formatting from matching across paragraph boundaries
The Lexer compiles all patterns with the `s` (DOTALL) flag via ParallelRegex::getPerlMatchingFlags(), which makes `.` match newlin
prevent inline formatting from matching across paragraph boundaries
The Lexer compiles all patterns with the `s` (DOTALL) flag via ParallelRegex::getPerlMatchingFlags(), which makes `.` match newlines. Inline formatting modes use lookaheads like `\*\*(?=.*\*\*)` to verify a closing delimiter exists, so with DOTALL a lone `**` happily matched its "closer" many paragraphs later, swallowing blank lines into a single <strong> run.
Add CONTENT_UNTIL_PARA on AbstractMode — a regex snippet matching any character unless it would start a paragraph break (blank line, possibly with horizontal whitespace). Update all inline formatting entry patterns (Strong, Emphasis, Underline, Monospace, Subscript, Superscript, Deleted) to use it in their closing-delimiter lookaheads.
Emphasis also gets a real closing-`//` check; its previous lookahead just verified "content exists with a non-colon char" without requiring the closing delimiter at all.
Single newlines inside a delimiter pair still match (multi-line formatting); only blank lines end it.
BREAKING: This means you no longer can mark multiple paragraphs as bold or strike them out. On the other hand it prevents accidentally breaking the page layout by missing a closing delimiter (as reported many many times over the years) eg. #1025 #3588 #1056
show more ...
|
| 71096e46 | 18-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
move handler methods into ParserMode classes and rename Handler
Each ParserMode class now implements handle() from ModeInterface, containing the token handling logic that previously lived as individ
move handler methods into ParserMode classes and rename Handler
Each ParserMode class now implements handle() from ModeInterface, containing the token handling logic that previously lived as individual methods on Doku_Handler.
The Handler class (formerly Doku_Handler) is the single dispatch point: Lexer passes tokens to Handler::handleToken() which routes to mode objects, plugins, or returns false. The Lexer only tokenizes and resolves mapHandler aliases.
Key changes: - Add handle() to ModeInterface, implemented by all mode classes - Move Doku_Handler to dokuwiki\Parsing\Handler namespace - File extends Code (shared parsing via $type property) - Quotes uses mapHandler() + Handler::getModeName() for sub-modes - Media::parseMedia() replaces Doku_Handler_Parse_Media() - Code::parseHighlightOptions() replaces parse_highlight_options() - Per-parse state (footnote, doublequote) stays on Handler - Deprecated wrappers kept for base/header/internallink/media - Class alias and rector rules added for backward compatibility
show more ...
|
| 7958e698 | 16-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
decouple hardcoded mode names in Eol and Preformatted
Eol.php hardcoded ['listblock', 'table'] as modes to skip, and Preformatted.php hardcoded [\*\-] as a negative lookahead for list markers. Both
decouple hardcoded mode names in Eol and Preformatted
Eol.php hardcoded ['listblock', 'table'] as modes to skip, and Preformatted.php hardcoded [\*\-] as a negative lookahead for list markers. Both embed knowledge that belongs to the respective block modes, not to Eol/Preformatted. Adding a new block mode that handles its own EOL or uses different line start markers would require editing these unrelated files — a hidden coupling.
Listblock and Table now register themselves on ModeRegistry during preConnect(). Eol queries getBlockEolModes() and Preformatted queries getLineStartMarkers() to build its lookahead dynamically. Each mode owns its own data, and new block modes can participate without touching unrelated files.
show more ...
|
| 1f443476 | 16-Apr-2026 |
Andreas Gohr <andi@splitbrain.org> |
split Formatting into individual classes per formatting type
Introduce AbstractFormatting as a base class and seven concrete classes (Strong, Emphasis, Underline, Monospace, Subscript, Superscript,
split Formatting into individual classes per formatting type
Introduce AbstractFormatting as a base class and seven concrete classes (Strong, Emphasis, Underline, Monospace, Subscript, Superscript, Deleted) that each define their own patterns and sort order. Delete the old Formatting class and update tests to use the new classes directly. ModeRegistry now treats formatting modes as regular built-in modes.
show more ...
|