GfmBacktickDouble.php - OpenGrok history log for /dokuwiki/inc/Parsing/ParserMode/GfmBacktickDouble.php

Revision	Date	Author	Comments
# 1c00c021	09-Jul-2026	Andreas Gohr <gohr@cosmocode.de>	fix(parser): validate inline formatting closers with a single memoized scan The inline formatting modes only open a span when a valid closer exists ahead. That check was a lookahead built on CONTENT fix(parser): validate inline formatting closers with a single memoized scan The inline formatting modes only open a span when a valid closer exists ahead. That check was a lookahead built on CONTENT_UNTIL_PARA, tested character by character up to the next paragraph break and re-evaluated from scratch for every opener candidate — openers times paragraph length. With pcre.jit=0 a crafted 32KB page took 16s and an ordinary 34KB page with long paragraphs 37s; with the JIT on (the PHP default) the per-character lookahead exhausted the JIT stack, the match silently failed, and the formatting — or everything after it — rendered as plain text. The check also decided the wrong thing. It scanned raw text, so a closer lookalike inside content the lexer consumes atomically — a nowiki or %% span, a backtick code span, a link, a URL — counted as a real closer even though the mode's exit pattern can never fire there. And it ignored the enclosing span: an inner delimiter whose only closer lay past the closer of the mode it sits in was entered anyway, so a stray delimiter paired with one in a following sibling span and dragged the boundary along — the `` in ''glob/.conf'' joined the `` of the next ''...'' span and corrupted the paragraph; the same held for //, * and __ inside monospace, and an emphasis opened inside ((...)) ran past the footnote's )) and the enclosing bold's *. Each formatting mode now declares its closer through Lexer::addCloserPattern(), mirroring addExitPattern(), and the lexer answers "does a valid closer exist ahead" with one anchored possessive scan per range instead of a lookahead per opener: - The scan runs left to right from the opener, hopping over opaque spans derived from already-registered patterns — a plain or special match is consumed in one step, an entry into a verbatim mode (nowiki, the backtick code spans) extends to that mode's first exit — so a closer lookalike inside consumed content is never mistaken for a closer. Each hop finds the earliest of boundary, closer, or opaque span in a single leftmost search, keeping the check linear. - An opener is rejected when the nearest enclosing mode that has a closer of its own would close before the opener's own closer, so a delimiter that can never close within its span stays literal. That ancestor is found by walking the mode stack past modes that declare no closer (plugins, footnotes); the nearest guarded ancestor suffices, as it was itself validated against its own when it opened. - Both verdicts are memoized and reset per parse() run: a proven closer validates every earlier candidate, and a proven closer-free range rejects every later candidate before the next boundary. With the lexer consuming each opened span, the whole parse is linear in document size. Closer patterns match the closing delimiter itself with flanking context in lookarounds — the convention exit patterns already follow — so closer positions compare exactly across modes and a closer directly after an inner opener is seen. AbstractFormatting derives the closer from the exit pattern and registers it with the paragraph break as the boundary, preserving the rule that formatting never spans paragraphs; a mode with other needs can pass a different boundary or none. Footnote declares its )) as a closer rather than guarding its (( entry with a (?=.)) lookahead, so the footnote becomes a boundary the scan sees and formatting inside it no longer pairs across the )); its closer takes no paragraph boundary, as footnotes are block-level. GfmEmphasis gains a closer pattern so single * emphasis is validated the same way, while its entry lookahead still enforces CommonMark nearest-delimiter pairing. GfmEmphasis and GfmStrong span bodies cannot contain their delimiter, so their in-pattern lookaheads stay linear on their own; the GFM backtick span bodies get deterministic alternatives with possessive quantifiers, removing their per-character backtracking. CONTENT_UNTIL_PARA is removed: any entry pattern built with it recreates the quadratic scan. ParallelRegex gains escapePattern() so embedded closer fragments follow the lexer's bare-parenthesis convention, reports PREG_JIT_STACKLIMIT_ERROR so a future JIT exhaustion surfaces instead of silently truncating, and no longer rewrites its registered patterns in place while compiling the compound regex. The adversarial 32KB page drops to 0.1s, the 37-second benign page to 0.1s, and a 128KB variant stays under 0.6s. show more ...
# 8ed75a23	22-Apr-2026	Andreas Gohr <gohr@cosmocode.de>	add GfmBacktickSingle / GfmBacktickDouble for GFM inline code spans Two new inline formatting modes covering GFM code spans in their n=1 and n=2 forms: GfmBacktickSingle `text` → <code>text< add GfmBacktickSingle / GfmBacktickDouble for GFM inline code spans Two new inline formatting modes covering GFM code spans in their n=1 and n=2 forms: GfmBacktickSingle `text` → <code>text</code> GfmBacktickDouble ``text`` → <code>text</code> Both emit monospace_open and monospace_close around an unformatted() call (the same instruction shape as DokuWiki's two-single-quote pair wrapping a nowiki span), so renderers that distinguish verbatim text from plain cdata — metadata, indexer, non-XHTML backends — treat the body as literal. GfmBacktickDouble extends GfmBacktickSingle to reuse handle() and the body-normalization helper; only the delimiter length and the body character class differ. Both share sort 165 and gate on Markdown being loaded. Design notes: * The lexer has no backreferences, so each length is its own mode. Length-boundary guards (?<!`)...(?!`) on every opener and closer ensure a run of two-or-more backticks is never read as an n=1 delimiter and a run of three-or-more is never read as n=2. The two modes never steal each other's input regardless of registration order — sort can't reach this kind of cross-position constraint. * Edge-whitespace handling and newline normalization live in handle(), not in the regex. On DOKU_LEXER_UNMATCHED the body is normalized: 1. CR/LF and LF become single spaces (GFM line-ending rule). 2. If the body starts and ends with a space and is not entirely whitespace, one space is stripped from each end. That produces the right GFM output for the tricky cases without special-casing the entry pattern: ` ` → <code> </code> (all-whitespace, no strip) ` a` → <code> a</code> (asymmetric, no strip) ` `` ` → <code>``</code> (interior run-of-2 + strip) ``foo`bar`` → <code>foo`bar</code> * Body character classes admit exactly the runs that cannot be valid closers for this mode's length: n=1 allows `[^`] \| ``+`, n=2 allows `[^`] \| `(?!`)`. That is what lets a single-backtick span contain a pair and a double-backtick span contain a lone backtick. * allowedModes is empty — no other inline parsing runs inside a span. Deliberately not implemented, with skip.php entries explaining why: 351 — code-span precedence over emphasis (foo`` expected to render as foo<code></code>). Cross-positional: the single-pass lexer matches leftmost-first and cannot reject an earlier emphasis opener because a later backtick span would consume its closer. A proper fix would need a pre-scan pass; sort values only break ties at the same position. 353 — the trailing " outside the code span gets converted to a curly quote by DokuWiki typography, diverging from spec HTML. 354 — raw HTML tag pass-through; DokuWiki does not render raw HTML by default. 356 — GFM angle-bracket autolink <http://…>: not implemented. Per-mode unit tests cover basic matching, flanking via the length- boundary guards, interior-run support in the body, edge-space stripping, newline normalization, all-whitespace bodies, paragraph- boundary rejection, content-is-literal, and sort values. ModeRegistryTest's gating data provider picks up both modes. Net effect on GfmSpecTest: eleven previously-red code-span examples now pass (339, 340, 341, 342, 344, 345, 346, 347, 349, 350, 357, 359 — the simple pairs, edge-space, interior-run, newline-normalization, and mismatched-run cases). Four skipped. Three remain pending outside the code-span scope (emphasis interactions that need GfmLink once that lands). show more ...

Revision

Date

Author

Comments

# 1c00c021

09-Jul-2026

Andreas Gohr <gohr@cosmocode.de>

fix(parser): validate inline formatting closers with a single memoized scan

The inline formatting modes only open a span when a valid closer exists
ahead. That check was a lookahead built on CONTENT

fix(parser): validate inline formatting closers with a single memoized scan

The inline formatting modes only open a span when a valid closer exists
ahead. That check was a lookahead built on CONTENT_UNTIL_PARA, tested
character by character up to the next paragraph break and re-evaluated
from scratch for every opener candidate — openers times paragraph
length. With pcre.jit=0 a crafted 32KB page took 16s and an ordinary
34KB page with long paragraphs 37s; with the JIT on (the PHP default)
the per-character lookahead exhausted the JIT stack, the match silently
failed, and the formatting — or everything after it — rendered as plain
text.

The check also decided the wrong thing. It scanned raw text, so a closer
lookalike inside content the lexer consumes atomically — a nowiki or %%
span, a backtick code span, a link, a URL — counted as a real closer
even though the mode's exit pattern can never fire there. And it ignored
the enclosing span: an inner delimiter whose only closer lay past the
closer of the mode it sits in was entered anyway, so a stray delimiter
paired with one in a following sibling span and dragged the boundary
along — the `*` in ''glob/*.conf'' joined the `*` of the next ''...''
span and corrupted the paragraph; the same held for //, ** and __ inside
monospace, and an emphasis opened inside ((...)) ran past the footnote's
)) and the enclosing bold's **.

Each formatting mode now declares its closer through
Lexer::addCloserPattern(), mirroring addExitPattern(), and the lexer
answers "does a valid closer exist ahead" with one anchored possessive
scan per range instead of a lookahead per opener:

- The scan runs left to right from the opener, hopping over opaque spans
derived from already-registered patterns — a plain or special match is
consumed in one step, an entry into a verbatim mode (nowiki, the
backtick code spans) extends to that mode's first exit — so a closer
lookalike inside consumed content is never mistaken for a closer. Each
hop finds the earliest of boundary, closer, or opaque span in a single
leftmost search, keeping the check linear.
- An opener is rejected when the nearest enclosing mode that has a closer
of its own would close before the opener's own closer, so a delimiter
that can never close within its span stays literal. That ancestor is
found by walking the mode stack past modes that declare no closer
(plugins, footnotes); the nearest guarded ancestor suffices, as it was
itself validated against its own when it opened.
- Both verdicts are memoized and reset per parse() run: a proven closer
validates every earlier candidate, and a proven closer-free range
rejects every later candidate before the next boundary. With the lexer
consuming each opened span, the whole parse is linear in document size.

Closer patterns match the closing delimiter itself with flanking context
in lookarounds — the convention exit patterns already follow — so closer
positions compare exactly across modes and a closer directly after an
inner opener is seen. AbstractFormatting derives the closer from the exit
pattern and registers it with the paragraph break as the boundary,
preserving the rule that formatting never spans paragraphs; a mode with
other needs can pass a different boundary or none.

Footnote declares its )) as a closer rather than guarding its (( entry
with a (?=.*)) lookahead, so the footnote becomes a boundary the scan
sees and formatting inside it no longer pairs across the )); its closer
takes no paragraph boundary, as footnotes are block-level. GfmEmphasis
gains a closer pattern so single * emphasis is validated the same way,
while its entry lookahead still enforces CommonMark nearest-delimiter
pairing. GfmEmphasis and GfmStrong span bodies cannot contain their
delimiter, so their in-pattern lookaheads stay linear on their own; the
GFM backtick span bodies get deterministic alternatives with possessive
quantifiers, removing their per-character backtracking.

CONTENT_UNTIL_PARA is removed: any entry pattern built with it recreates
the quadratic scan. ParallelRegex gains escapePattern() so embedded
closer fragments follow the lexer's bare-parenthesis convention, reports
PREG_JIT_STACKLIMIT_ERROR so a future JIT exhaustion surfaces instead of
silently truncating, and no longer rewrites its registered patterns in
place while compiling the compound regex.

The adversarial 32KB page drops to 0.1s, the 37-second benign page to
0.1s, and a 128KB variant stays under 0.6s.

# 8ed75a23

22-Apr-2026

Andreas Gohr <gohr@cosmocode.de>

add GfmBacktickSingle / GfmBacktickDouble for GFM inline code spans

Two new inline formatting modes covering GFM code spans in their n=1
and n=2 forms:

GfmBacktickSingle `text` → <code>text<

add GfmBacktickSingle / GfmBacktickDouble for GFM inline code spans

Two new inline formatting modes covering GFM code spans in their n=1
and n=2 forms:

GfmBacktickSingle `text` → <code>text</code>
GfmBacktickDouble ``text`` → <code>text</code>

Both emit monospace_open and monospace_close around an unformatted()
call (the same instruction shape as DokuWiki's two-single-quote pair
wrapping a nowiki span), so renderers that distinguish verbatim text
from plain cdata — metadata, indexer, non-XHTML backends — treat the
body as literal.

GfmBacktickDouble extends GfmBacktickSingle to reuse handle() and the
body-normalization helper; only the delimiter length and the body
character class differ. Both share sort 165 and gate on Markdown
being loaded.

Design notes:

* The lexer has no backreferences, so each length is its own mode.
Length-boundary guards (?<!`)...(?!`) on every opener and closer
ensure a run of two-or-more backticks is never read as an n=1
delimiter and a run of three-or-more is never read as n=2. The two
modes never steal each other's input regardless of registration
order — sort can't reach this kind of cross-position constraint.

* Edge-whitespace handling and newline normalization live in handle(),
not in the regex. On DOKU_LEXER_UNMATCHED the body is normalized:
1. CR/LF and LF become single spaces (GFM line-ending rule).
2. If the body starts and ends with a space and is not entirely
whitespace, one space is stripped from each end.
That produces the right GFM output for the tricky cases without
special-casing the entry pattern:
` ` → <code> </code> (all-whitespace, no strip)
` a` → <code> a</code> (asymmetric, no strip)
` `` ` → <code>``</code> (interior run-of-2 + strip)
``foo`bar`` → <code>foo`bar</code>

* Body character classes admit exactly the runs that cannot be valid
closers for this mode's length: n=1 allows `[^`] | ``+`, n=2 allows
`[^`] | `(?!`)`. That is what lets a single-backtick span contain
a pair and a double-backtick span contain a lone backtick.

* allowedModes is empty — no other inline parsing runs inside a span.

Deliberately not implemented, with skip.php entries explaining why:

351 — code-span precedence over emphasis (*foo`*` expected to render
as *foo<code>*</code>). Cross-positional: the single-pass
lexer matches leftmost-first and cannot reject an earlier
emphasis opener because a later backtick span would consume
its closer. A proper fix would need a pre-scan pass; sort
values only break ties at the same position.
353 — the trailing " outside the code span gets converted to a
curly quote by DokuWiki typography, diverging from spec HTML.
354 — raw HTML tag pass-through; DokuWiki does not render raw HTML
by default.
356 — GFM angle-bracket autolink <http://…>: not implemented.

Per-mode unit tests cover basic matching, flanking via the length-
boundary guards, interior-run support in the body, edge-space
stripping, newline normalization, all-whitespace bodies, paragraph-
boundary rejection, content-is-literal, and sort values.
ModeRegistryTest's gating data provider picks up both modes.

Net effect on GfmSpecTest: eleven previously-red code-span examples
now pass (339, 340, 341, 342, 344, 345, 346, 347, 349, 350, 357, 359
— the simple pairs, edge-space, interior-run, newline-normalization,
and mismatched-run cases). Four skipped. Three remain pending outside
the code-span scope (emphasis interactions that need GfmLink once
that lands).