GfmEmphasisTest.php - OpenGrok history log for /dokuwiki/_test/tests/Parsing/ParserMode/GfmEmphasisTest.php

Revision	Date	Author	Comments
# 1c00c021	09-Jul-2026	Andreas Gohr <gohr@cosmocode.de>	fix(parser): validate inline formatting closers with a single memoized scan The inline formatting modes only open a span when a valid closer exists ahead. That check was a lookahead built on CONTENT fix(parser): validate inline formatting closers with a single memoized scan The inline formatting modes only open a span when a valid closer exists ahead. That check was a lookahead built on CONTENT_UNTIL_PARA, tested character by character up to the next paragraph break and re-evaluated from scratch for every opener candidate — openers times paragraph length. With pcre.jit=0 a crafted 32KB page took 16s and an ordinary 34KB page with long paragraphs 37s; with the JIT on (the PHP default) the per-character lookahead exhausted the JIT stack, the match silently failed, and the formatting — or everything after it — rendered as plain text. The check also decided the wrong thing. It scanned raw text, so a closer lookalike inside content the lexer consumes atomically — a nowiki or %% span, a backtick code span, a link, a URL — counted as a real closer even though the mode's exit pattern can never fire there. And it ignored the enclosing span: an inner delimiter whose only closer lay past the closer of the mode it sits in was entered anyway, so a stray delimiter paired with one in a following sibling span and dragged the boundary along — the `` in ''glob/.conf'' joined the `` of the next ''...'' span and corrupted the paragraph; the same held for //, * and __ inside monospace, and an emphasis opened inside ((...)) ran past the footnote's )) and the enclosing bold's *. Each formatting mode now declares its closer through Lexer::addCloserPattern(), mirroring addExitPattern(), and the lexer answers "does a valid closer exist ahead" with one anchored possessive scan per range instead of a lookahead per opener: - The scan runs left to right from the opener, hopping over opaque spans derived from already-registered patterns — a plain or special match is consumed in one step, an entry into a verbatim mode (nowiki, the backtick code spans) extends to that mode's first exit — so a closer lookalike inside consumed content is never mistaken for a closer. Each hop finds the earliest of boundary, closer, or opaque span in a single leftmost search, keeping the check linear. - An opener is rejected when the nearest enclosing mode that has a closer of its own would close before the opener's own closer, so a delimiter that can never close within its span stays literal. That ancestor is found by walking the mode stack past modes that declare no closer (plugins, footnotes); the nearest guarded ancestor suffices, as it was itself validated against its own when it opened. - Both verdicts are memoized and reset per parse() run: a proven closer validates every earlier candidate, and a proven closer-free range rejects every later candidate before the next boundary. With the lexer consuming each opened span, the whole parse is linear in document size. Closer patterns match the closing delimiter itself with flanking context in lookarounds — the convention exit patterns already follow — so closer positions compare exactly across modes and a closer directly after an inner opener is seen. AbstractFormatting derives the closer from the exit pattern and registers it with the paragraph break as the boundary, preserving the rule that formatting never spans paragraphs; a mode with other needs can pass a different boundary or none. Footnote declares its )) as a closer rather than guarding its (( entry with a (?=.)) lookahead, so the footnote becomes a boundary the scan sees and formatting inside it no longer pairs across the )); its closer takes no paragraph boundary, as footnotes are block-level. GfmEmphasis gains a closer pattern so single * emphasis is validated the same way, while its entry lookahead still enforces CommonMark nearest-delimiter pairing. GfmEmphasis and GfmStrong span bodies cannot contain their delimiter, so their in-pattern lookaheads stay linear on their own; the GFM backtick span bodies get deterministic alternatives with possessive quantifiers, removing their per-character backtracking. CONTENT_UNTIL_PARA is removed: any entry pattern built with it recreates the quadratic scan. ParallelRegex gains escapePattern() so embedded closer fragments follow the lexer's bare-parenthesis convention, reports PREG_JIT_STACKLIMIT_ERROR so a future JIT exhaustion surfaces instead of silently truncating, and no longer rewrites its registered patterns in place while compiling the compound regex. The adversarial 32KB page drops to 0.1s, the 37-second benign page to 0.1s, and a 128KB variant stays under 0.6s. show more ...
# 47a02a10	04-Jun-2026	Andreas Gohr <gohr@cosmocode.de>	Parsing: make parse syntax a per-parse value, drop ModeInterface The active parse's syntax flavour is a per-parse question, not process- global state: within a single request a plugin can render bun Parsing: make parse syntax a per-parse value, drop ModeInterface The active parse's syntax flavour is a per-parse question, not process- global state: within a single request a plugin can render bundled DokuWiki-syntax text inside an otherwise-Markdown page. Yet ModeRegistry was a singleton that read $conf['syntax'] and the $PARSER_MODES global, and every mode reached it through ModeRegistry::getInstance() — so the flavour lived in shared mutable state that two parses in one request would fight over. Make the registry a short-lived value instead: - ModeRegistry is constructed once per parse with an explicit $syntax and injected into Parser, Handler and every mode. getSyntax() / isDwPreferred() / isMdPreferred() consult $this->syntax; the DOKU_UNITTEST-gated mode-list cache hack is gone (each registry is fresh, nothing to invalidate). - p_get_instructions() is now the single place in the pipeline where $conf['syntax'] is read; from there the flavour travels as a parameter. No code under inc/Parsing/ reads $conf['syntax'] directly anymore — the five syntax-reading modes (Preformatted, GfmHr, GfmEscape, Externallink, GfmQuote) route through $this->registry. Keep the two concepts apart, as documented in the ModeRegistry and AbstractMode docblocks: the user's configured preference stays in $conf['syntax'] for UI code (toolbar, settings), while the active parse's syntax is a parameter carried by the registry. $PARSER_MODES is demoted to a deprecated, read-only mirror, published during loadPluginModes() — third-party syntax plugins (columnlist, alphalist2, phpwikify, skipentity) and the bundled info plugin read the global directly, often from their constructors, so the taxonomy must stay visible there. No core code reads the mirror. Fold ModeInterface into AbstractMode while here: getSort()/handle() are abstract, the connect callbacks carry defaults, and the public $Lexer "FIXME should be done by setter" becomes setLexer()/getLexer() injected by Parser::addMode() alongside the registry. Nested-content resolution moves to the allowedCategories()/filterAllowedModes() hooks, resolved once when the registry is attached. Tests build their own parser/registry through ParserTestBase::setSyntax() instead of mutating $conf and calling the removed ModeRegistry::reset(). show more ...
# 13a62f81	04-May-2026	Andreas Gohr <andi@splitbrain.org>	rename syntax flavors 'dokuwiki' / 'markdown' to 'dw' / 'md' Symmetry with the existing 'dw+md' / 'md+dw' setting values.
# 685560eb	28-Apr-2026	Andreas Gohr <andi@splitbrain.org>	add GfmListblock for GFM lists GfmListblock captures an entire list block atomically with one addSpecialPattern match, then walks the captured text in handle() grouping lines into items. Each item's add GfmListblock for GFM lists GfmListblock captures an entire list block atomically with one addSpecialPattern match, then walks the captured text in handle() grouping lines into items. Each item's body is dedented to its content column and parsed by ModeRegistry::getSubParser() so block content (paragraphs, fenced code, blockquotes, plugin blocks) works inside items uniformly. Sub-parsed calls are wrapped in a Nest call before they reach the outer handler, matching the Footnote pattern: the main handler's Block rewriter treats nest as opaque and the renderer base class unwraps it transparently, so multi-paragraph items don't get double-wrapped in <p>. Marker syntax: -, , + (unordered) or 1-9 digits followed by . or ) (ordered). Indentation is a 2-space-multiple step starting at 0; depth = (indent / 2) + 1, odd indents round down, tabs become two spaces. The first ordered item's number drives the start attribute on <ol> via the listo_open $start parameter. GfmLists subclasses AbstractListsRewriter with the GFM marker parser; the state machine on the base class is shared with DW Lists. GfmListblock loads only when $conf['syntax'] is markdown or md+dw. Under those settings the DW Listblock is suppressed because the two list models conflict — DW's mandatory 2-space indent rule vs GFM's zero-indent top-level rule, and -//+ markers shared. Plugins that relied on Listblock loading under md+dw will see it absent there. Sub-parser exclusion set: CATEGORY_BASEONLY (no Header inside list items) and gfm_listblock itself (defensive guard against re-entry on pathological inputs; nested lists are handled by the outer pattern, not by re-entry). Tests cover marker variants, ordered start numbers, nested lists at two and three levels, inline formatting inside items, marker- character switches keeping one list, type switches splitting the list, fenced code inside items, multi-paragraph (loose) items, and two regressions on blank-line tolerance inside the captured block. SpecCompatRenderer learns to render the list call sequence, and spec.txt tests for digit/marker-width/lazy-continuation behavior that GfmListblock deliberately doesn't implement are documented in gfm-spec/skip.php with the per-bucket reasons (A-F). Drops two now-obsolete entries from skip.php (image escapes that land via earlier GfmLink/GfmMedia work) and inlines the Setext explanation that previously pointed at SPEC.md. Replaces the SPEC.md reference in GfmEmphasisTest with the inline reason. show more ...
# 0244be5c	21-Apr-2026	Andreas Gohr <gohr@cosmocode.de>	add GfmDeleted mode for GFM strikethrough (`~~text~~`) Shares the deleted_open/deleted_close instructions with DW's <del> mode. Entry/exit anchors `(?<!~)` / `(?!~)` reject runs of three or more til add GfmDeleted mode for GFM strikethrough (`~~text~~`) Shares the deleted_open/deleted_close instructions with DW's <del> mode. Entry/exit anchors `(?<!~)` / `(?!~)` reject runs of three or more tildes so fenced-code markers remain untouched. Also trim redundant class-level docblocks on sibling Gfm test files. show more ...
# bcefb8ae	20-Apr-2026	Andreas Gohr <gohr@cosmocode.de>	add GFM emphasis and underscore-delimited strong modes Three new inline formatting modes for GitHub Flavored Markdown: GfmEmphasis `text` → <em> GfmEmphasisUnderscore `_text_` add GFM emphasis and underscore-delimited strong modes Three new inline formatting modes for GitHub Flavored Markdown: GfmEmphasis `text` → <em> GfmEmphasisUnderscore `_text_` → <em> (MD-preferred only) GfmStrongUnderscore `__text__` → <strong> (MD-preferred only) All three emit the same handler instructions as DokuWiki's Emphasis / Strong, so existing renderers need no changes. Design notes: * Lexer mode names use snake_case (gfm_emphasis, gfm_emphasis_underscore, gfm_strong_underscore) to keep PascalCase readable at the class level. The asterisk variant emits `emphasis_open`/`emphasis_close` via the getInstructionName() hook, so DW's Emphasis (`//...//`) and GfmEmphasis (`...`) can coexist in mixed modes without a lexer state collision while still producing the same <em> output. * Underscore variants gate on Markdown-preferred syntax (`markdown`, `md+dw`) because `__` otherwise means DW underline. GfmStrongUnderscore sorts at 70 (matching Strong) — below Underline at 90 — so when loaded it wins the lexer race for `__` runs. Underline is already gated out of MD-preferred modes in the previous commit. * Entry patterns enforce the simplified CommonMark flanking rules already shared across DW inline modes (non-whitespace adjacency, no paragraph-boundary crossing) plus the word-boundary check for underscore variants using NO_WORD_BEFORE / NO_WORD_AFTER. The positive non-word-char enumeration makes them multibyte-safe without requiring the `u` flag: `für_etwas` and `пристаням_стремятся_` correctly stay literal. Per-mode unit tests cover basic matching, single-char bodies, leading/trailing-whitespace rejection, empty-delimiter rejection, paragraph-boundary rejection, multibyte intraword protection, and sort values. ModeRegistryTest's gating data provider picks up the three new rules. show more ...