1<?xml version="1.0" encoding="utf-8"?> 2 3<overlay xmlns="http://hoa-project.net/xyl/xylophone"> 4<yield id="chapter"> 5 6 <p><strong>Compilers</strong> allow to <strong>analyze</strong> and 7 <strong>manipulate textual</strong> data. There is numerous usages. 8 <code>Hoa\Compiler</code> offers to manipulate several compilers based on 9 needs.</p> 10 11 <h2 id="Table_of_contents">Table of contents</h2> 12 13 <tableofcontents id="main-toc" /> 14 15 <h2 id="Introduction" for="main-toc">Introduction</h2> 16 17 <blockquote cite="https://en.wikipedia.org/wiki/Nicolas_Boileau-Despr%C3%A9aux">What 18 is conceived well is expressed clearly, and the right words come 19 easily.</blockquote> 20 <p>A <strong>language</strong> is a way to express or to 21 <strong>formulate</strong> a <strong>solution</strong> to a 22 <strong>problem</strong>. And there exists a lot of problems. We read and 23 write in several languages daily, and some of these languages are 24 <strong>understood</strong> by <strong>machines</strong>. This operation is 25 possible thanks to <strong>compilers</strong>.</p> 26 <p><a href="https://en.wikipedia.org/wiki/Formal_language">Formal 27 languages</a> is a theory that includes the <strong>automatic 28 analysis</strong> of those languages through tools like 29 <strong>automata</strong> or <strong>grammars</strong>. It is necessary to 30 have a detailed explanation to understand well all these concepts. However, we 31 are going to popularise as much as needed to allow a correct usage of 32 <code>Hoa\Compiler</code>.</p> 33 34 <h3 id="Language_and_grammar" for="main-toc">Language and grammar</h3> 35 36 <p>A <strong>language</strong> is a set of <strong>words</strong>. Each word 37 is a <strong>sequence</strong> of <strong>symbols</strong> belonging to an 38 <strong>alphabet</strong>. A symbol represents the smallest <strong>lexical 39 unit</strong> of a language, it is atomic and we will call it a 40 <strong>lexeme</strong> (also often mentionned as a token). These sequences of 41 lexemes representing words are built with <strong>rules</strong>. From a word 42 and a root rule, we will try to <strong>derive</strong> its sub-rules. If a 43 derivation exists, then the word is considered as <strong>valid</strong>, else 44 it is considered as <strong>invalid</strong>. We also speak about words 45 <strong>matching</strong>. For example, if we consider the following 46 rules:</p> 47 <pre><code> exp ::= exp + exp 48 | number 49 number ::= digit number 50 | digit 51digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9</code></pre> 52 <p>The word that we are going to match is <code>7 + 35</code>. The root rule 53 is <code><em>exp</em></code>. If we derive (from the left to the right and 54 from the top to the bottom), we can have 55 <code><em>exp</em> + <em>exp</em></code> or <code><em>number</em></code> (the 56 <strong>disjunction</strong>, i.e. the “or”, is represented by the 57 “<code>|</code>” symbol):</p> 58 <pre><code>exp + exp | number 59→ exp + exp 60→ ( exp + exp | number ) + exp 61→ number + exp 62→ ( digit number | digit ) + exp</code></pre> 63 <p>We continue to derive until we <strong>eliminate</strong> every rules in 64 order to have only <strong>lexemes</strong>:</p> 65 <pre><code>… 66→ ( digit number | digit ) + exp 67→ digit + exp 68→ ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) + exp 69→ 7 + exp 70→ 7 + ( exp + exp | number ) 71→ 7 + number 72→ 7 + ( digit number | digit ) 73→ 7 + digit number 74→ 7 + ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) number 75→ 7 + 3 number 76→ 7 + 3 ( digit number | digit ) 77→ 7 + 3 digit 78→ 7 + 3 ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) 79→ 7 + 3 5</code></pre> 80 <p>A derivation exists to match the word <code>7 + 35</code>, thus it is a 81 valid word for those rules.</p> 82 <p>A set of rules is called a <strong>grammar</strong>. And so, a grammar 83 represents a <strong>language</strong>!</p> 84 <p>However, these are different kinds of grammars. In 1956, the 85 <a href="https://en.wikipedia.org/wiki/Chomsky_hierarchy">Chomsky 86 hierarchy</a> has been formulated, classifying grammars in four 87 <strong>levels</strong>:</p> 88 <ol> 89 <li><strong>unrestricted</strong> grammars, matching langages known as 90 Turing languages, rules have no restriction,</li> 91 <li><strong>context-sensitive</strong> grammars, matching contextual 92 languages,</li> 93 <li><strong>context-free</strong> grammars, matching algebraic languages, 94 based on stacked automata,</li> 95 <li><strong>regular</strong> grammars, matching regular languages.</li> 96 </ol> 97 <p>Each level includes the next level. <code>Hoa\Compiler</code> only treats 98 languages defined by grammars of level 3 and 4. To give a quick idea, regular 99 grammars can be connected to 100 <a href="https://en.wikipedia.org/wiki/Regular_expression">regular 101 expressions</a> (like <a href="http://pcre.org/">PCRE</a>), well-known by 102 developers. But regular grammars are not able to match a <strong>pair of 103 symbols</strong> (like parentheses, braces or quotes), while algebraic 104 languages allow that (thanks to the concept of stack of lexemes).</p> 105 106 <h3 id="Matching_words" for="main-toc">Matching words</h3> 107 108 <div id="parsers" class="schema"></div> 109 <script> 110 Hoa.Document.onReady(function ( ) { 111 112 var paper = Hoa.Graph(Hoa.$('#parsers'), 800, 180); 113 var grid = paper.grid(0, 0, 800, 180, 5, 2); 114 var word = grid.push(paper.rect(0, 0, 140, 80, 3, 'word'), 0, 0); 115 var sequence = grid.push(paper.rect(0, 0, 140, 80, 3, 'sequence'), 2, 0); 116 var trace = grid.push(paper.rect(0, 0, 140, 80, 3, 'result'), 4, 0); 117 grid.push(paper.rect(0, 0, 140, 50, 3, 'abcdef'), 0, 1); 118 grid.push(paper.rect(0, 0, 380, 50, 3, '[[a ⟼ …], [bc ⟼ …], [d ⟼ …], [ef ⟼ …]]'), 2, 1); 119 grid.push(paper.rect(0, 0, 140, 50, 3, 'valid/invalid'), 4, 1); 120 121 paper.link.between(word, sequence, 'lexical analyzer'); 122 paper.link.between(sequence, trace, 'syntactic analyzer'); 123 }); 124 </script> 125 <p>In general, the compilation process begins with two 126 <strong>analyzes</strong>: <strong>lexical</strong> and 127 <strong>syntactic</strong>. A lexical analysis <strong>splits</strong> a word 128 in a <strong>sequence of lexemes</strong>. This sequence will thereafter be 129 used by the syntactic analyzer in order to verify that the word 130 <strong>belongs</strong> to the language.</p> 131 <p>According to the grammar, the matching will not be done in the same way, 132 but the principle stays the same: using lexemes one after the other in the 133 sequence and verify that they allow <strong>going forward</strong> in the 134 <strong>derivation</strong> of grammar's rules.</p> 135 <p>Syntactic analyzes are also classified into categories: LL, LR, LALR etc. 136 <code>Hoa\Compiler</code> only offers LL syntactic analyzers, stands for 137 Left-to-right Leftmost derivation, i.e. from the topest rule to the deepest, 138 and rules are derived from the left to the right. Here, again, there is 139 sub-categories, whose two supported by <code>Hoa\Compiler</code>: LL(1) and 140 LL(*). Globally, we speak about LL(<em>k</em>) syntactic analyzer: if a lexeme 141 does not allow to derive a rule as expected, then the analyzer is allowed to 142 <strong>go back</strong> to <em>k</em> step backward; we also speak about 143 backtrack. Put another way, rules can be <strong>ambiguous</strong>: each time 144 we derive a rule of the grammar, we have several possible choices and the 145 analyzer can be wrong, this is why it needs to backtrack sometimes. The 146 <em>k</em> variable defines the ambiguity <strong>level</strong>. If a grammar 147 can be analyzed by a LL(1) syntactic analyzer, it is known as 148 <strong>unambiguous</strong>: each time a lexeme is used to derive our rules, 149 there is only one choice. And if we have a LL(*) syntactic analyzer, it means 150 that the <em>k</em> variable is <strong>undefined</strong>. The following 151 example illustrates an unambiguous grammar:</p> 152 <pre><code>rule ::= a b c | d e f</code></pre> 153 <p>And this example illustrates an ambiguous grammar:</p> 154 <pre><code>rule1 ::= a rule2 155rule2 ::= b rule3 | b rule4 156rule3 ::= c d 157rule4 ::= e f</code></pre> 158 <p>Let's see what happens when trying to find a derivation for the word 159 <code>abef</code> from the root rule <code>rule1</code>:</p> 160 <pre><code>rule1 161→ a rule2 <em> a bef ✔</em> 162 → a (b rule3 | b rule4) <em> a bef</em> 163 → a b rule3 <em> ab ef ✔</em> 164 → a b c d <em> abe f ✖</em> 165 ← a b rule3 <em> ab ef ✖</em> 166 ← a (b rule3 | b rule4) <em> a bef</em> 167 → a b rule4 <em> ab ef ✔</em> 168 → a b e f <em>abef ✔</em></code></pre> 169 <p>The <code>rule2</code> rule is ambiguous, which can lead to a bad 170 derivation and consequently a backtrack.</p> 171 172 <h2 id="LLk_compiler-compiler" for="main-toc">LL(<em>k</em>) 173 compiler-compiler</h2> 174 175 <p>Writing compilers is a <strong>laborious</strong> task. It is not always 176 difficult but most of the time, it is repetitive and it takes time. This is 177 why there exists <strong>compilers of compilers</strong>, or in other words, 178 compilers generators. Most of the time, these compiler-compilers use an 179 <strong>intermediary</strong> language to write a grammar. The 180 <code>Hoa\Compiler</code> library provides the 181 <code>Hoa\Compiler\Llk\Llk</code> class that allows the writing of 182 compiler-compilers through a <strong>dedicated</strong> language.</p> 183 184 <h3 id="PP_language" for="main-toc">PP language</h3> 185 186 <p>The PP language, standing for <em>PHP Parser</em>, allows to express 187 <strong>algebraic grammars</strong>. It is written in files having the 188 <code>.pp</code> extension (please, see the 189 <a href="@central_resource:path=Library/Compiler/.Mime"><code>hoa://Library/Compiler/.Mime</code></a> 190 file).</p> 191 <p>A grammar is composed of <strong>lexemes</strong> and 192 <strong>rules</strong>. The declaration of a lexeme has the following syntax: 193 <code>%token <em>namespace_in</em>:<em>name</em> <em>value</em> -> 194 <em>namespace_out</em></code>, where <code><em>name</em></code> represents the 195 <strong>name</strong> of the lexeme, <code><em>value</em></code> represents 196 its <strong>value</strong>, expressed with the 197 <a href="http://pcre.org/">PCRE</a> format (take care to not match an empty 198 value, in which case an exception will be thrown), and 199 <code><em>namespace_in</em></code> and <code><em>namespace_out</em></code> 200 represent the names of <strong>namespaces</strong> and are optional (the 201 default name is <code>default</code>). For example <code>number</code> 202 describing a number composed of digits from <code>0</code> to 203 <code>9</code>:</p> 204 <pre><code class="language-pp">%token number \d+</code></pre> 205 <p>Namespaces represent disjointed <strong>subsets</strong> of lexemes, used 206 to <strong>ease</strong> analyzes. A <code>%skip</code> declaration is similar 207 to <code>%token</code> excepts that it represents a lexeme to 208 <strong>skip</strong>, it means to not consider. A common example of 209 <code>%skip</code> lexemes are spaces:</p> 210 <pre><code class="language-pp">%skip space \s</code></pre> 211 <p>To explain rules, we will use the <code>Json.pp</code> grammar as an 212 example, which is a softly <strong>simplified</strong> version of the 213 <a href="http://json.org/">JSON language</a> (please, see the 214 <a href="https://tools.ietf.org/html/rfc4627">RFC4627</a>). The 215 <strong>complete</strong> grammar is located in the 216 <a href="@central_resource:path=Library/Json/Grammar.pp"><code>hoa://Library/Json/Grammar.pp</code></a> 217 file. Thus:</p> 218 <pre><code class="language-pp">%skip space \s 219// Scalars. 220%token true true 221%token false false 222%token null null 223// Strings. 224%token quote_ " -> string 225%token string:string [^"]+ 226%token string:_quote " -> default 227// Objects. 228%token brace_ { 229%token _brace } 230// Arrays. 231%token bracket_ \[ 232%token _bracket \] 233// Rest. 234%token colon : 235%token comma , 236%token number \d+ 237 238value: 239 &lt;true> | &lt;false> | &lt;null> | string() | object() | array() | number() 240 241string: 242 ::quote_:: &lt;string> ::_quote:: 243 244number: 245 &lt;number> 246 247#object: 248 ::brace_:: pair() ( ::comma:: pair() )* ::_brace:: 249 250#pair: 251 string() ::colon:: value() 252 253#array: 254 ::bracket_:: value() ( ::comma:: value() )* ::_bracket::</code></pre> 255 <p>Notice that we have two namespaces of lexemes: <code>default</code> and 256 <code>string</code> (it allows to avoid <strong>confusion</strong> with the 257 <code>quote_</code> and <code>string::_quote</code> lexemes that have the same 258 representation). We also notice the <code>value</code> rule, which is a 259 <strong>disjunction</strong> of several lexemes and rules. The 260 <strong>constructions</strong> of the PP language are the following:</p> 261 <ul> 262 <li><code><em>rule</em>()</code> to <strong>call</strong> a rule,</li> 263 <li><code>&lt;<em>token</em>></code> and <code>::<em>token</em>::</code> 264 to <strong>declare</strong> a lexeme (namespaces do not appear here),</li> 265 <li><code>|</code> for a <strong>disjunction</strong> (a choice),</li> 266 <li><code>(<em>…</em>)</code> for a group,</li> 267 <li><code><em>e</em>?</code> to say that <code><em>e</em></code> is 268 <strong>optional</strong>,</li> 269 <li><code><em>e</em>+</code> to say that <code><em>e</em></code> can be 270 present <strong>1 or more</strong> times,</li> 271 <li><code><em>e</em>*</code> to say that <code><em>e</em></code> can be 272 present <strong>0 or more</strong> times,</li> 273 <li><code><em>e</em>{<em>x</em>,<em>y</em>}</code> to say that 274 <code><em>e</em></code> can be present between <code><em>x</em></code> and 275 <code><em>y</em></code> times,</li> 276 <li><code>#<em>node</em></code> to create a <strong>node</strong> 277 <code>node</code> in the resulting tree,</li> 278 <li><code><em>token</em>[<em>i</em>]</code> to <strong>unify</strong> 279 lexemes between each others.</li> 280 </ul> 281 <p>Few constructions but largely enough.</p> 282 <p>Finally, the grammar of the PP language is written… with the PP language! 283 You can find it in the 284 <a href="@central_resource:path=Library/Compiler/Llk/Llk.pp"><code>hoa://Library/Compiler/Llk/Llk.pp</code></a> 285 file.</p> 286 287 <h3 id="Compilation_process" for="main-toc">Compilation process</h3> 288 289 <div id="overview" class="schema"></div> 290 <script> 291 Hoa.Document.onReady(function ( ) { 292 293 var paper = Hoa.Graph(Hoa.$('#overview'), 800, 360); 294 var flow = paper.grid(0, 0, 800, 360, 1, 4); 295 flow.push(paper.rect(0, 0, 200, 70, 3, 'lexical analyzer')); 296 flow.push(paper.rect(0, 0, 200, 70, 3, 'syntactic analyzer')); 297 flow.push(paper.rect(0, 0, 200, 70, 3, 'trace')); 298 flow.push(paper.rect(0, 0, 200, 70, 3, 'AST')); 299 300 var annot = paper.grid(180, 0, 80, 360, 1, 4); 301 annot.push(paper.rect(0, 0, 45, 45, 22.5, '1')); 302 annot.push(paper.rect(0, 0, 45, 45, 22.5, '2')); 303 annot.push(paper.rect(0, 0, 45, 45, 22.5, '3')); 304 annot.push(paper.rect(0, 0, 45, 45, 22.5, '4')); 305 }); 306 </script> 307 <p>The compilation process used by <code>Hoa\Compiler\Llk\Llk</code> is 308 classical: it starts by <strong>lexically</strong> analyzing the textual data, 309 the word, i.e. to transform our data in a sequence of lexemes. The declaration 310 <strong>order</strong> of the lexemes is primordial because the lexical 311 analyzer will use them one after the other. Next, the 312 <strong>syntactic</strong> analyzer comes into play in order to 313 <strong>match</strong> our data.</p> 314 <p>If the syntactic analysis is successful, we obtain a 315 <strong>trace</strong>. This trace can be transformed into an AST, stands for 316 Abstract Syntax Tree. This tree represents our textual data after the 317 analysis. One advantage is that it can visited (we will detail this part 318 later), which allows us to add new <strong>constraints</strong> which can be 319 not be expressed in the grammar, such as type verification.</p> 320 <p>Let's manipulate <code>Hoa\Compiler\Llk\Llk</code> a little bit. This class 321 is a <strong>helper</strong> to easily read a grammar expressed with the PP 322 format. This class takes only one argument, which is an input stream that 323 points to the grammar and returns a <code>Hoa\Compiler\Llk\Parser</code> 324 compiler ready to use. On this compiler, we will call the 325 <code>Hoa\Compiler\Llk\Parser::parse</code> to analyze a JSON data. If the 326 data is correct, we will get an AST, otherwise an exception will be thrown. 327 Finally, we will use the <code>Hoa\Compiler\Visitor\Dump</code> visitor to 328 print our AST:</p> 329 <pre><code class="language-php">// 1. Load grammar. 330$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')); 331 332// 2. Parse a data. 333$ast = $compiler->parse('{"foo": true, "bar": [null, 42]}'); 334 335// 3. Dump the AST. 336$dump = new Hoa\Compiler\Visitor\Dump(); 337echo $dump->visit($ast); 338 339/** 340 * Will output: 341 * > #object 342 * > > #pair 343 * > > > token(string:string, foo) 344 * > > > token(true, true) 345 * > > #pair 346 * > > > token(string:string, bar) 347 * > > > #array 348 * > > > > token(null, null) 349 * > > > > token(number, 42) 350 */</code></pre> 351 <p>When we write and test a grammar, we will repeat these three tasks very 352 <strong>often</strong>. This is why the <code>hoa</code> script provides the 353 <code>compiler:pp</code> command. This command proposes to analyze a data 354 based on a grammar and to apply a visitor if needed on the resulting AST. 355 Notice that the grammar can be read through a 356 <a href="https://en.wikipedia.org/wiki/Pipeline_(Unix)">pipe</a>:</p> 357 <pre><code class="language-shell">$ echo '[1, [1, [2, 3], 5], 8]' | hoa compiler:pp Json.pp 0 --visitor dump 358> #array 359> > token(number, 1) 360> > #array 361> > > token(number, 1) 362> > > #array 363> > > > token(number, 2) 364> > > > token(number, 3) 365> > > token(number, 5) 366> > token(number, 8)</code></pre> 367 <p>This is a useful way to <strong>quickly</strong> test a data against a 368 grammar. Do not hesitate to read the help of the <code>compiler:pp</code> 369 command to get more details!</p> 370 <p>The analyzes start with the <strong>root</strong> rule but we can use 371 <strong>another rule</strong> thanks to the second argument of the 372 <code>Hoa\Compiler\Llk\Parser::parse</code> method:</p> 373 <pre><code class="language-php">$compiler->parse('{"foo": true, "bar": [null, 42]}', 'object');</code></pre> 374 <p>To use the root rule, we have to use <code>null</code>.</p> 375 <p>And finally, to not generate the AST but only know if a data is valid or 376 not, we can use the last argument of this method by using 377 <code>false</code>:</p> 378 <pre><code class="language-php">$valid = $compiler->parse('{"foo": true, "bar": [null, 42]}', null, false); 379var_dump($valid); 380 381/** 382 * Will output: 383 * bool(true) 384 */</code></pre> 385 386 <h4 id="Errors" for="main-toc">Errors</h4> 387 388 <p>The compilation errors are represented by exceptions, thus:</p> 389 <pre><code class="language-shell">$ echo '{"foo" true}' | hoa compiler:pp Json.pp 0 --visitor dump 390Uncaught exception (Hoa\Compiler\Exception\UnexpectedToken): 391Hoa\Compiler\Llk\Parser::parse(): (0) Unexpected token "true" (true) at line 1 392and column 8: 393{"foo" true} 394 ↑ 395in hoa://Library/Compiler/Llk/Parser.php at line 1</code></pre> 396 <p>Several exceptions can be thrown according to the context:</p> 397 <ul> 398 <li>during the <strong>lexical</strong> analysis, 399 <code>Hoa\Compiler\Exception\UnrecognizedToken</code> when a lexeme is not 400 recognized, i.e. when the textual data can not be split into a sequence of 401 lexemes, and <code>Hoa\Compiler\Exception\Lexer</code> when other more 402 general errors happen, for example if a lexeme matches an empty value,</li> 403 <li>during the <strong>syntactic</strong> analysis, 404 <code>Hoa\Compiler\Exception\UnexpectedToken</code> when a lexeme is not 405 expected, i.e. it is not able to derive the rules of the grammar.</li> 406 </ul> 407 <p>The parent exception is <code>Hoa\Compiler\Exception\Exception</code>.</p> 408 409 <h4 id="Nodes" for="main-toc">Nodes</h4> 410 411 <p>The compilation process very often succeeds by the 412 <strong>production</strong> of an AST. It is important to control its 413 <strong>form</strong>, its <strong>size</strong>, the data it 414 <strong>holds</strong> etc. This is why it is necessary to understand the 415 <code>#<em>node</em></code> notation because it allows to create 416 <strong>nodes</strong> in the AST. First, lexemes declared with the 417 <code>&lt;<em>token</em>></code> syntax will appear in the tree, while the 418 other lexemes, declared with the <code>::<em>token</em>::</code> syntax, will 419 not appear. Indeed, in our last example, the <code>quote_</code>, 420 <code>brace_</code>, <code>colon</code>, <code>comma</code> etc. lexemes do 421 not appear. Next, it is important to notice that the declaration of nodes can 422 be <strong>overwritten</strong> inside a <strong>same rule</strong>. Finally, 423 a rule name can be preceeded by <code>#</code>, just like the 424 <code>#array</code> rule, that allows to define a <strong>default</strong> 425 node, but it can be overwritten. For example, if we replace the 426 <code>array</code> rule by:</p> 427 <pre><code class="language-pp">#array: 428 ::bracket_:: value() ( ::comma:: value() #bigarray )* ::_bracket::</code></pre> 429 <p>If the array contains only one value, the node will be named 430 <code>#array</code>, else it will be named <code>#bigarray</code>. Let's 431 see:</p> 432 <pre><code class="language-shell">$ echo '[42]' | hoa compiler:pp Json.pp 0 --visitor dump 433> #array 434> > token(number, 42) 435$ echo '[4, 2]' | hoa compiler:pp Json.pp 0 --visitor dump 436> #bigarray 437> > token(number, 4) 438> > token(number, 2)</code></pre> 439 <p>Of course, a node can be created or not based on the 440 <strong>derivation</strong> of a rule. The mecanism is normally pretty much 441 <strong>intuitive</strong>.</p> 442 443 <h4 id="Namespaces" for="main-toc">Namespaces</h4> 444 445 <p>Let's detail a little bit the behavior of the lexical analyzer regarding 446 the <strong>namespaces</strong>.</p> 447 <p><strong>Lexemes</strong> are put in namespaces, it means that only the 448 lexemes of the <strong>current</strong> namespace are used by the lexical 449 analyzer. The default namespace is <code>default</code>. To declare the 450 namespace of a lexeme, we have to use the <code>:</code> operator. When a 451 lexeme is consumed, it is able to <strong>change</strong> the current 452 namespace for the rest of the lexical analysis, thanks to the <code>-></code> 453 operator. Thus, we will declare five lexemes: <code>foo</code> and 454 <code>bar</code> in the <code>default</code> namespace, <code>baz</code> in 455 <code>ns1</code> and <code>qux</code> and <code>foo</code> in 456 <code>ns2</code>. The fact that <code>foo</code> is declared two times is not 457 embarrassing because it is put in <strong>different</strong> namespaces:</p> 458 <pre><code class="language-pp">%token foo fo+ -> ns1 459%token bar ba?r+ -> ns2 460%token ns1:baz ba?z+ -> default 461%token ns2:qux qux+ 462%token ns2:foo FOO -> ns1</code></pre> 463 <p>Writing <code>default:bar</code> is strictly equivalent to 464 <code>bar</code>. The <code>foo</code> lexeme leads into <code>ns1</code>, 465 <code>bar</code> into <code>ns2</code>, <code>ns1:baz</code> into 466 <code>default</code>, <code>ns2:qux</code> stays in <code>ns2</code> and 467 <code>ns2:foo</code> leads into <code>ns1</code>. Let's observe the sequence 468 of lexemes produced by the lexical analyzer with the following data 469 <code>fooooobzzbarrrquxFOObaz</code>:</p> 470 <pre><code class="language-php">$pp = '%token foo fo+ -> ns1' . "\n" . 471 '%token bar ba?r+ -> ns2' . "\n" . 472 '%token ns1:baz ba?z+ -> default' . "\n" . 473 '%token ns2:qux qux+' . "\n" . 474 '%token ns2:foo FOO -> ns1'; 475 476// 1. Parse PP. 477Hoa\Compiler\Llk\Llk::parsePP($pp, $tokens, $rawRules); 478 479// 2. Run the lexical analyzer. 480$lexer = new Hoa\Compiler\Llk\Lexer(); 481$sequence = $lexer->lexMe('fooooobzzbarrrquxFOObaz', $tokens); 482 483// 3. Pretty-print the result. 484$format = '%' . (strlen((string) count($sequence)) + 1) . 's ' . 485 '%-13s %-20s %-20s %6s' . "\n"; 486$header = sprintf($format, '#', 'namespace', 'token name', 'token value', 'offset'); 487 488echo $header, str_repeat('-', strlen($header)), "\n"; 489 490foreach ($sequence as $i => $token) { 491 printf( 492 $format, 493 $i, 494 $token['namespace'], 495 $token['token'], 496 $token['value'], 497 $token['offset'] 498 ); 499} 500 501/** 502 * Will output: 503 * # namespace token name token value offset 504 * --------------------------------------------------------------------- 505 * 0 default foo fooooo 0 506 * 1 ns1 baz bzz 6 507 * 2 default bar barrr 9 508 * 3 ns2 qux qux 14 509 * 4 ns2 foo FOO 17 510 * 5 ns1 baz baz 20 511 * 6 default EOF EOF 23 512 */</code></pre> 513 <p>We read the lexemes, their namespace and their default value in the table. 514 The data has been successfully <strong>splitted</strong>, and we have jumped 515 from one namespace to the other. If this time we try with the 516 <code>foqux</code> data, we will have an error: <code>fo</code> is matched by 517 the <code>foo</code> lexeme in the <code>default</code> namespace, we then 518 jump to the <code>ns1</code> namespace, and here, no lexeme in this namespace 519 is able to recognize at least <code>q</code>. An error is thrown.</p> 520 <p>So far, we have seen how to jump from one namespace to another with the 521 <code>-></code> operator. No <strong>history</strong> about the used 522 namespaces is stored. However, in some rare cases, it is able for a lexeme to 523 be accessible through <strong>several</strong> namespaces and we would like 524 that a lexeme jumps back to the <strong>previous</strong> namespace. Put 525 another way, we would like a history of the used namespaces and being able to 526 navigate (backward only) in it. This is the role of the 527 <code>__shift__ * <em>n</em></code> keyword, in conjunction with the 528 <code>-></code> operator as usual. <code>__shift__</code> is equivalent to 529 say: jump back to the previous namespace. <code>__shift__</code> is equivalent 530 to <code>__shift__ * 1</code>, and <code>__shift__ * <em>n</em></code> to say: 531 jump back <code><em>n</em></code> times to the previous namespace.</p> 532 <p>When the <code>__shift__</code> keyword appears in the grammar, namespaces 533 are managed like a <strong>stack</strong>, hence the <em>shift</em> 534 vocabulary. We have to take care to shift namespaces regularly in order to 535 avoid a huge stack.</p> 536 <p>Let's take the example of the following grammar:</p> 537 <pre><code class="language-pp">%token foo1 a -> ns1 538%token foo2 x -> ns2 539%token end e 540 541%token ns1:bar b 542%token ns1:baz c -> ns3 543%token ns1:tada t -> __shift__ 544 545%token ns2:bar y 546%token ns2:baz z -> ns3 547%token ns2:tada T -> __shift__ 548 549%token ns3:qux = -> __shift__ 550 551#test: 552 ( &lt;foo1> | &lt;foo2> ) &lt;bar> &lt;baz> &lt;qux> &lt;tada> &lt;end></code></pre> 553 <p>This grammar recognizes two data: <code>abc=te</code> and 554 <code>xyz=Te</code>. If the first lexeme <code>foo1</code> is matched, the 555 syntactic analyzer will jump to the <code>ns1</code> namespace. When it will 556 match the <code>ns1:baz</code> namespace, it will jump into <code>ns3</code>. 557 Then, when it will match <code>ns3:qux</code>, it will jump back to the 558 previous namespace, being <code>ns1</code>. And so on. Now, if it does not 559 match <code>foo1</code> but <code>foo2</code>, it will jump to the 560 <code>ns2</code> namespace, and then in <code>ns3</code> and then in 561 <code>ns2</code> again. The <code>__shift__</code> keywords in 562 <code>ns1:tada</code> and <code>ns2:tada</code> are only used to shift 563 namespaces but no ambiguity exists: these namespaces are accessible by only 564 one namespace, namely <code>default</code>.</p> 565 <p>Now, let's save this grammar in the <code>NamespaceStack.pp</code> file and 566 use the <code>-s/--token-sequence</code> option of the <code>hoa 567 compiler:pp</code> command:</p> 568 <pre><code class="language-shell">$ echo -n 'abc=te' | hoa compiler:pp NamespaceStack.pp 0 --token-sequence 569 # namespace token name token value offset 570------------------------------------------------------------------------------- 571 0 default foo1 a 0 572 1 ns1 bar b 1 573 2 ns1 baz c 2 574 3 ns3 qux = 3 575 4 ns1 tada t 4 576 5 default end e 5 577 6 default EOF EOF 6 578 579$ echo -n 'xyz=Te' | hoa compiler:pp NamespaceStack.pp 0 --token-sequence 580 # namespace token name token value offset 581------------------------------------------------------------------------------- 582 0 default foo2 x 0 583 1 ns2 bar y 1 584 2 ns2 baz z 2 585 3 ns3 qux = 3 586 4 ns2 tada T 4 587 5 default end e 5 588 6 default EOF EOF 6</code></pre> 589 <p>We see that the lexical analyzer has successfully jumped between all 590 namespaces, as expected. We had two ways to access to the <code>ns3</code> 591 namespace: either from <code>ns1</code> or from <code>ns2</code>. The analyzer 592 has succeeded to create a history of namespaces and navigate in it.</p> 593 594 <h4 id="Unification" for="main-toc">Unification</h4> 595 596 <p>A new feature brought by the PP language regarding other grammar languages 597 is the capability to express a <strong>unification</strong> of lexemes. Let's 598 imagine the following grammar:</p> 599 <pre><code class="language-pp">%token quote '|" 600%token string \w+ 601 602rule: 603 ::quote:: &lt;string> ::quote::</code></pre> 604 <p>The quotes surrounding the string can be of two kinds: simple, with the 605 “<code>'</code>” symbol, or double, with the “<code>"</code>” symbol. Thus, 606 the <code>"foo"</code> and <code>'foo'</code> data are valid, but 607 <strong>also</strong> <code>"foo'</code> and <code>'foo"</code>!</p> 608 <p>The unification of lexemes allows to add an additionnal 609 <strong>constraint</strong> on the <strong>value</strong> of lexemes at 610 runtime. The syntax is the following: <code><em>token</em>[<em>i</em>]</code>. 611 The value of <code><em>i</em></code> indicates what lexemes will have the same 612 value. And finally, the unification is <strong>locale</strong> to an instance 613 of a rule, it means that there is no unification between lexemes of different 614 rules and the unification is only applied on the <strong>current</strong> 615 rule. Thus, the example becomes:</p> 616 <pre><code class="language-pp">rule: 617 ::quote[0]:: &lt;string> ::quote[0]::</code></pre> 618 <p>Which invalidates the <code>"foo'</code> and <code>'foo"</code> data.</p> 619 <p>Unification finds numerous applications, such as the name of opening and 620 closing tags in the <a href="http://w3.org/TR/xml11/">XML language</a>.</p> 621 622 <h3 id="Abstract_Syntax_Tree" for="main-toc">Abstract Syntax Tree</h3> 623 624 <div id="overview_ast" class="schema"></div> 625 <script> 626 Hoa.Document.onReady(function ( ) { 627 628 var paper = Hoa.Graph(Hoa.$('#overview_ast'), 800, 360); 629 var flow = paper.grid(0, 0, 800, 360, 1, 4); 630 flow.push(paper.rect(0, 0, 200, 70, 3, 'lexical analyzer').attr({opacity: .3})); 631 flow.push(paper.rect(0, 0, 200, 70, 3, 'syntactic analyzer').attr({opacity: .3})); 632 flow.push(paper.rect(0, 0, 200, 70, 3, 'trace').attr({opacity: .3})); 633 flow.push(paper.rect(0, 0, 200, 70, 3, 'AST')); 634 635 var annot = paper.grid(180, 0, 80, 360, 1, 4); 636 annot.push(paper.rect(0, 0, 45, 45, 22.5, '1').attr({opacity: .3})); 637 annot.push(paper.rect(0, 0, 45, 45, 22.5, '2').attr({opacity: .3})); 638 annot.push(paper.rect(0, 0, 45, 45, 22.5, '3').attr({opacity: .3})); 639 annot.push(paper.rect(0, 0, 45, 45, 22.5, '4')); 640 }); 641 </script> 642 <p>An <strong>abstract syntax</strong> tree represents a textual data in a 643 <strong>structural</strong> form. Each <strong>node</strong> of this tree is 644 represented by the <code>Hoa\Compiler\Llk\TreeNode</code> class. Among the 645 most useful methods, we find:</p> 646 <ul> 647 <li><code>getId</code> to get the identifier of the node,</li> 648 <li><code>getValueToken</code> to get the name of the lexeme,</li> 649 <li><code>getValueValue</code> to get the value of the lexeme,</li> 650 <li><code>isToken</code> if the token represents a lexeme,</li> 651 <li><code>getChild(<em>$i</em>)</code> to get the 652 <code><em>$i</em></code>-th child of the node,</li> 653 <li><code>getChildren</code> to get all the children,</li> 654 <li><code>getChildrenNumber</code> to count the number of children,</li> 655 <li><code>getParent</code> to get the parent node,</li> 656 <li><code>getData</code> to get a reference to a data array,</li> 657 <li><code>accept</code> to apply the visitor.</li> 658 </ul> 659 <p>Visitors are the most simple way to <strong>cross</strong> an AST. As an 660 example, let's write the <code>PrettyPrinter</code> visitor which will rewrite 661 a JSON data with our own conventions (spacing, line breaking etc.). A visitor 662 must implement the <code>Hoa\Visitor\Visit</code> interface and will have only 663 one method to implement: <code>visit</code> with three arguments: 664 the element and two optional arguments (by reference and by copy). Let's 665 see:</p> 666 <pre><code class="language-php">class PrettyPrinter implements Hoa\Visitor\Visit 667{ 668 public function visit ( 669 Hoa\Visitor\Element $element, 670 &amp;$handle = null, 671 $eldnah = null 672 ) { 673 static $i = 0; 674 static $indent = ' '; 675 676 // One behaviour per node in the AST. 677 switch ($element->getId()) { 678 // Object: { … }. 679 case '#object': 680 echo '{', "\n"; 681 ++$i; 682 683 foreach ($element->getChildren() as $e => $child) { 684 if (0 &lt; $e) { 685 echo ',', "\n"; 686 } 687 688 echo str_repeat($indent, $i); 689 $child->accept($this, $handle, $eldnah); 690 } 691 692 echo "\n", str_repeat($indent, --$i), '}'; 693 694 break; 695 696 // Array: [ … ]. 697 case '#array': 698 echo '[', "\n"; 699 ++$i; 700 701 foreach ($element->getChildren() as $e => $child) { 702 if (0 &lt; $e) { 703 echo ',', "\n"; 704 } 705 706 echo str_repeat($indent, $i); 707 $child->accept($this, $handle, $eldnah); 708 } 709 710 echo "\n", str_repeat($indent, --$i), ']'; 711 712 break; 713 714 // Pair: "…": …. 715 case '#pair': 716 echo 717 $element->getChild(0)->accept($this, $handle, $eldnah), 718 ': ', 719 $element->getChild(1)->accept($this, $handle, $eldnah); 720 721 break; 722 723 // Many tokens. 724 case 'token': 725 switch ($element->getValueToken()) { 726 // String: "…". 727 case 'string': 728 echo '"', $element->getValueValue(), '"'; 729 730 break; 731 732 // Booleans. 733 case 'true': 734 case 'false': 735 736 // Null. 737 case 'null': 738 739 // Number. 740 case 'number': 741 echo $element->getValueValue(); 742 743 break; 744 } 745 746 break; 747 } 748 } 749}</code></pre> 750 <p>We are going to see an example:</p> 751 <pre><code class="language-php">$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')); 752$ast = $compiler->parse('{"foo": true, "bar": [null, 42]}'); 753$prettyPrint = new PrettyPrinter(); 754echo $prettyPrint->visit($ast); 755 756/** 757 * Will output: 758 * { 759 * "foo": true, 760 * "bar": [ 761 * null, 762 * 42 763 * ] 764 * } 765 */</code></pre> 766 <p>The <code>getData</code> method is very useful to <strong>store</strong> 767 data that are likely to be <strong>reused</strong>, for example from one 768 visitor to another. This method returns a <strong>reference</strong> to an 769 array; thus:</p> 770 <pre><code class="language-php">$data = $element->getData(); 771 772if (!isset($data['previousComputing'])) { 773 throw new Exception('Need a previous computing.', 0); 774} 775 776$previous = $data['previousComputing'];</code></pre> 777 <p>It is common to use one visitor per <strong>constraint</strong>: 778 verification of symbols, verification of types etc. Some of them can set data 779 for the next ones. The <code>getData</code> method does not require any 780 data structure, it only provides an access to an array. Feel free to do what 781 you want with it.</p> 782 <p>Using the <code>Hoa\Compiler\Llk\TreeNode</code> is very 783 <strong>trivial</strong> and we will use it most of the time with a 784 visitor.</p> 785 786 <h3 id="Traces" for="main-toc">Traces</h3> 787 788 <div id="overview_trace" class="schema"></div> 789 <script> 790 Hoa.Document.onReady(function ( ) { 791 792 var paper = Hoa.Graph(Hoa.$('#overview_trace'), 800, 360); 793 var flow = paper.grid(0, 0, 800, 360, 1, 4); 794 flow.push(paper.rect(0, 0, 200, 70, 3, 'lexical analyzer').attr({opacity: .3})); 795 flow.push(paper.rect(0, 0, 200, 70, 3, 'syntactic analyzer').attr({opacity: .3})); 796 flow.push(paper.rect(0, 0, 200, 70, 3, 'trace')); 797 flow.push(paper.rect(0, 0, 200, 70, 3, 'AST').attr({opacity: .3})); 798 799 var annot = paper.grid(180, 0, 80, 360, 1, 4); 800 annot.push(paper.rect(0, 0, 45, 45, 22.5, '1').attr({opacity: .3})); 801 annot.push(paper.rect(0, 0, 45, 45, 22.5, '2').attr({opacity: .3})); 802 annot.push(paper.rect(0, 0, 45, 45, 22.5, '3')); 803 annot.push(paper.rect(0, 0, 45, 45, 22.5, '4').attr({opacity: .3})); 804 }); 805 </script> 806 <p>The LL(<em>k</em>) compiler provided by Hoa is clearly composed of 807 <strong>layers</strong>. Each layer can be used with a maximal modularity and 808 extensibility. When the grammar is translated into “machine rules” and that 809 the lexical and syntactic analyzers have validated a data, a 810 <strong>trace</strong> is computed. This trace is a flat array containing only 811 three kind of classes:</p> 812 <ul> 813 <li><code>Hoa\Compiler\Llk\Rule\Entry</code> when the syntactic analyzer has 814 opened a rule,</li> 815 <li><code>Hoa\Compiler\Llk\Rule\Ekzit</code> when the syntactic analyzer has 816 closed a rule,</li> 817 <li><code>Hoa\Compiler\Llk\Rule\Token</code> when the syntactic analyzer has 818 encountered a lexeme.</li> 819 </ul> 820 <p>We can get the trace with the 821 <code>Hoa\Compiler\Llk\Parser::getTrace</code> method. To understand this 822 trace, we will start with an example:</p> 823 <pre><code class="language-php">$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')); 824$ast = $compiler->parse('{"foo": true, "bar": [null, 42]}'); 825$i = 0; 826 827foreach ($compiler->getTrace() as $element) { 828 if ($element instanceof Hoa\Compiler\Llk\Rule\Entry) { 829 echo str_repeat('> ', ++$i), 'enter ', $element->getRule(), "\n"; 830 } elseif ($element instanceof Hoa\Compiler\Llk\Rule\Token) { 831 echo 832 str_repeat(' ', $i + 1), 'token ', $element->getTokenName(), 833 ', consumed ', $element->getValue(), "\n"; 834 } else { 835 echo str_repeat('&lt; ', $i--), 'ekzit ', $element->getRule(), "\n"; 836 } 837} 838 839/** 840 * Will output: 841 * > enter value 842 * > > enter object 843 * token brace_, consumed { 844 * > > > enter pair 845 * > > > > enter string 846 * token quote_, consumed " 847 * token string, consumed foo 848 * token _quote, consumed " 849 * &lt; &lt; &lt; &lt; ekzit string 850 * token colon, consumed : 851 * > > > > enter value 852 * token true, consumed true 853 * &lt; &lt; &lt; &lt; ekzit value 854 * &lt; &lt; &lt; ekzit pair 855 * > > > enter 13 856 * > > > > enter 12 857 * token comma, consumed , 858 * > > > > > enter pair 859 * > > > > > > enter string 860 * token quote_, consumed " 861 * token string, consumed bar 862 * token _quote, consumed " 863 * &lt; &lt; &lt; &lt; &lt; &lt; ekzit string 864 * token colon, consumed : 865 * > > > > > > enter value 866 * > > > > > > > enter array 867 * token bracket_, consumed [ 868 * > > > > > > > > enter value 869 * token null, consumed null 870 * &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; ekzit value 871 * > > > > > > > > enter 21 872 * > > > > > > > > > enter 20 873 * token comma, consumed , 874 * > > > > > > > > > > enter value 875 * token number, consumed 42 876 * &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; ekzit value 877 * &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; ekzit 20 878 * &lt; &lt; &lt; &lt; &lt; &lt; &lt; &lt; ekzit 21 879 * token _bracket, consumed ] 880 * &lt; &lt; &lt; &lt; &lt; &lt; &lt; ekzit array 881 * &lt; &lt; &lt; &lt; &lt; &lt; ekzit value 882 * &lt; &lt; &lt; &lt; &lt; ekzit pair 883 * &lt; &lt; &lt; &lt; ekzit 12 884 * &lt; &lt; &lt; ekzit 13 885 * token _brace, consumed } 886 * &lt; &lt; ekzit object 887 * &lt; ekzit value 888 */</code></pre> 889 <p>This example reveals several things. First, the informations given by the 890 trace can be useful: if we are on a rule, we have its <strong>name</strong> 891 (with the <code>getRule</code> method), and if we are on a lexeme, we have its 892 <strong>name</strong> (with the <code>getTokenName</code> method), its 893 <strong>representation</strong> (expressed in the PCRE format, with the 894 <code>getRepresentation</code> method), its <strong>value</strong> (with the 895 <code>getValue</code> method), if it is a node to <strong>keep</strong> in the 896 AST (with the <code>isKept</code> method) and its <strong>unification</strong> 897 index if it exists (with the <code>getUnificationIndex</code> method). Of 898 course, all these informations can be edited, which can influence the 899 construction of the AST which is the (optional) next step.</p> 900 <p>Next, we notice that sometimes we enter in a rule that exists in the 901 grammar, like <code>object</code>, <code>pair</code>, <code>value</code> etc., 902 and sometimes we enter in a <strong>numerical</strong> rule, like 903 <code>13</code>, <code>12</code>, <code>21</code>, <code>20</code> etc. When 904 the grammar is interpreted in order to be translated into “machine rules”, 905 each one of these rules is <strong>linearized</strong> regarding the PP 906 operators:</p> 907 <ul> 908 <li><code>Hoa\Compiler\Llk\Rule\Choice</code> for a disjunction</li> 909 <li><code>Hoa\Compiler\Llk\Rule\Concatenation</code> for a 910 concatenation,</li> 911 <li><code>Hoa\Compiler\Llk\Rule\Repetition</code> for a repetition,</li> 912 <li><code>Hoa\Compiler\Llk\Rule\Token</code> for a token (like seen 913 above).</li> 914 </ul> 915 <p>All these rules in this format are accessible through the 916 <code>Hoa\Compiler\Llk\Parser::getRules</code> methods, represented by an 917 array. We will print all the <strong>accessible</strong> rules from the root 918 for a better understanding:</p> 919 <pre><code class="language-php">$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')); 920 921// 1. Get all rules. 922$rules = $compiler->getRules(); 923 924// 2. Start with the root rule. 925$stack = [$compiler->getRootRule() => true]; 926 927while (false !== current($stack)) { 928 $rule = key($stack); 929 next($stack); 930 echo 931 "\n", '"', $rule, '" is a ', 932 strtolower(substr(get_class($rules[$rule]), 22)); 933 934 $subrules = $rules[$rule]->getContent(); 935 936 // 3a. Token. 937 if (null === $subrules) { 938 continue; 939 } 940 941 echo ' of rules: '; 942 943 // 3b. Other rules. 944 foreach ((array) $rules[$rule]->getContent() as $subrule) { 945 if (!array_key_exists($subrule, $stack)) { 946 // 4. Unshift new rules to print. 947 $stack[$subrule] = true; 948 } 949 950 echo $subrule, ' '; 951 } 952} 953 954/** 955 * Will output: 956 * "value" is a choice of rules: 1 2 3 string object array number 957 * "1" is a token 958 * "2" is a token 959 * "3" is a token 960 * "string" is a concatenation of rules: 5 6 7 961 * "object" is a concatenation of rules: 10 pair 13 14 962 * "array" is a concatenation of rules: 18 value 21 22 963 * "number" is a token 964 * "5" is a token 965 * "6" is a token 966 * "7" is a token 967 * "10" is a token 968 * "pair" is a concatenation of rules: string 16 value 969 * "13" is a repetition of rules: 12 970 * "14" is a token 971 * "18" is a token 972 * "21" is a repetition of rules: 20 973 * "22" is a token 974 * "16" is a token 975 * "12" is a concatenation of rules: 11 pair 976 * "20" is a concatenation of rules: 19 value 977 * "11" is a token 978 * "19" is a token 979 */</code></pre> 980 <p>If we read the <code>object</code> rule, we know that it is a concatenation 981 of the <code>10</code>, <code>pair</code>, <code>13</code> and <code>14</code> 982 rules. <code>10</code> is a lexeme, <code>pair</code> is a concatenation of 983 the <code>string</code>, <code>16</code> and <code>value</code> rules, and so 984 on. The initial grammar is translated in order to be in its 985 <strong>tiniest</strong> form. It allows to <strong>reason</strong> on rules 986 in an <strong>easier</strong> and <strong>faster</strong> way. Indeed, 987 computations on the grammar are not restricted to AST. In the previous step, 988 with the trace, we are already able to apply computations.</p> 989 990 <h3 id="Generation" for="main-toc">Generation</h3> 991 992 <p>A grammar can be used to satisfy two goals: to <strong>validate</strong> a 993 data (if we see it as an automata) or to <strong>generate</strong> a data. So 994 far, we have seen how to validate a data through several steps: 995 <strong>initialization</strong> of our compiler, running of the lexical and 996 syntactic <strong>analyzers</strong>, producing a <strong>trace</strong>, 997 transformation of this trace into an <strong>AST</strong> in order to 998 <strong>visit</strong> that tree. But we can stop at the first step, the 999 initialization of our compiler, to exploit the rules in order to generate a 1000 data which will be valid regarding our grammar.</p> 1001 <p><code>Hoa\Compiler\Llk\Sampler</code> provides three data 1002 <strong>generation</strong> algorithms:</p> 1003 <ul> 1004 <li>random uniform generation,</li> 1005 <li>bounded exhaustive generation,</li> 1006 <li>grammar coverage-based generation.</li> 1007 </ul> 1008 <p>Why providing three algorithms? Because it is illusory to think that one 1009 algorithm can satisfy <strong>numerous</strong> context of usages. Each one 1010 fits different needs, we will explain them later.</p> 1011 <p>Generating a data based on a grammar requires <strong>two 1012 steps</strong>:</p> 1013 <ol> 1014 <li>generating values for all <strong>lexemes</strong> in order to get raw 1015 data,</li> 1016 <li>generating a <strong>path</strong> in the rules of the grammar.</li> 1017 </ol> 1018 <p>A path is equivalent to a derivation, the vocabulary is different according 1019 to our goal: validation or generation.</p> 1020 1021 <h4 id="Isotropic_generation_of_lexemes" for="main-toc">Isotropic generation 1022 of lexemes</h4> 1023 1024 <p>In order to generate values for lexemes, no matter what algorithm is used, 1025 it has to be <strong>fast</strong>. We are going to use a process called 1026 <strong>isotropic</strong>. We start with a rule and we go forward only by 1027 using <strong>random</strong> and <strong>locally uniform</strong> (only for 1028 this choice) choices. For instance, if we have a disjunction between three 1029 sub-rules, we will randomly and uniformly choose between 1 and 3. If we have a 1030 concatenation, we will just concatenate each parts. And finally, a repetition 1031 is nothing more than a disjunction of concatenation: indeed, 1032 <code><em>e</em>{1,3}</code> is strictly equivalent to <code><em>e</em> | 1033 <em>ee</em> | <em>eee</em></code>. Let's see:</p> 1034 <pre><code>([ae]+|[x-z]!){1,3} <em>repeat <em>[ae]+|[x-z]!</em> 2 times</em> 1035→ ([ae]+|[x-z]!)([ae]+|[x-z]!) <em>choose between <em>[ae]+</em> and <em>[x-z]!</em></em> 1036→ ([ae]+)([ae]+|[x-z]!) <em>repeat <code>ae</code> 2 times</em> 1037→ [ae][ae]([ae]+|[x-z]!) <em>choose between <em>a</em> and <em>e</em></em> 1038→ e[ae]([ae]+|[x-z]!) <em>again</em> 1039→ ea([ae]+|[x-z]!) <em>choose between <em>[ae]+</em> and <em>[x-z]!</em></em> 1040→ ea([x-z]!) <em>choose between <em>x</em>, <em>y</em> and <em>z</em></em> 1041→ eay!</code></pre> 1042 <p>This generation is naive but this is not important. What is important is 1043 the path generation in rules, or put another way, the generation of 1044 <strong>sequences of lexemes</strong>.</p> 1045 1046 <h4 id="Uniform_random_generation" for="main-toc">Random uniform 1047 generation</h4> 1048 1049 <p>The first algorithm is the <strong>random</strong> and 1050 <strong>uniform</strong> generation. This algorithm is useful if we would like 1051 to generate sequences of lexemes of a <strong>given size <em>n</em></strong> 1052 with a <strong>uniformity</strong>, not local (like in the isotropic 1053 generation), but on the <strong>set</strong> of all the possible sequences. 1054 Succinctly, the algorithm works in two steps: <strong>pre-computation</strong> 1055 (once per size) followed by a <strong>generation</strong>. The pre-computation 1056 is an <strong>automatic</strong> step and computes the <strong>number</strong> 1057 of all the possible sequences and sub-sequences of size <em>n</em>. This step 1058 helps to compute <strong>cumulative distribution functions</strong> in order 1059 to <strong>guide</strong> the final sequences of lexemes.</p> 1060 <p>We will generate 10 random data of size 7, it means composed of 7 lexemes. 1061 That's why we will use the <code>Hoa\Compiler\Llk\Sampler\Uniform</code> class 1062 that takes our grammar in first argument, the lexeme value generator in second 1063 argument (here <code>Hoa\Regex\Visitor\Isotropic</code>) and finally a 1064 size:</p> 1065 <pre><code class="language-php">$sampler = new Hoa\Compiler\Llk\Sampler\Uniform( 1066 // Grammar. 1067 Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')), 1068 // Token sampler. 1069 new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()), 1070 // Length. 1071 7 1072); 1073 1074for ($i = 0; $i &lt; 10; ++$i) { 1075 echo $i, ' => ', $sampler->uniform(), "\n"; 1076} 1077 1078/** 1079 * Will output: 1080 * 0 => [ false , null , null ] 1081 * 1 => [ " l " , null ] 1082 * 2 => [ [ true ] , true ] 1083 * 3 => [ [ [ 4435 ] ] ] 1084 * 4 => [ [ [ 9366 ] ] ] 1085 * 5 => [ true , false , null ] 1086 * 6 => { " |h&lt;# " : false } 1087 * 7 => [ [ [ false ] ] ] 1088 * 8 => [ false , true , 7 ] 1089 * 9 => [ false , 5 , 79 ] 1090 */</code></pre> 1091 <p>We can redefine the size with the 1092 <code>Hoa\Compiler\Llk\Sampler\Uniform::setLength</code> method. We have 1093 noticed that some repetition operators have no upper bound, like 1094 <code>+</code> or <code>*</code>; in this case, we set it to <em>n</em>.</p> 1095 1096 <h4 id="Bounded_exhaustive_generation" for="main-toc">Bounded exhaustive 1097 generation</h4> 1098 1099 <p>The second algorithm is the <strong>bounded exhaustive</strong> generation. 1100 This algorithm generates <strong>all</strong> sequences of lexemes (for the 1101 exhaustive side) from size 1 <strong>to</strong> <em>n</em> (for the 1102 bounded side). An interesting aspect of this algorithm is that the 1103 exhaustivity is kind of a <strong>proof</strong>! The algorithm behaves like 1104 an iterator and is very simple to use:</p> 1105 <pre><code class="language-php">$sampler = new Hoa\Compiler\Llk\Sampler\BoundedExhaustive( 1106 // Grammar. 1107 Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')), 1108 // Token sampler. 1109 new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()), 1110 // Length. 1111 7 1112); 1113 1114foreach ($sampler as $i => $data) { 1115 echo $i, ' => ', $data, "\n"; 1116} 1117 1118/** 1119 * Will output: 1120 * 0 => true 1121 * 1 => false 1122 * 2 => null 1123 * 3 => " 8u2 " 1124 * 4 => { " ,M@aj{ " : true } 1125 * 5 => { " x`|V " : false } 1126 * 6 => { " NttB " : null } 1127 * 7 => { " eJWwA " : 0 } 1128 * 8 => [ true ] 1129 * 9 => [ true , true ] 1130 * 10 => [ true , true , true ] 1131 * 11 => [ true , true , false ] 1132 * 12 => [ true , true , null ] 1133 * 13 => [ true , true , 32 ] 1134 * 14 => [ true , false ] 1135 * 15 => [ true , false , true ] 1136 * 16 => [ true , false , false ] 1137 * 17 => [ true , false , null ] 1138 * 18 => [ true , false , 729 ] 1139 * 19 => [ true , null ] 1140 * 20 => [ true , null , true ] 1141 * 21 => [ true , null , false ] 1142 * … 1143 * 157 => [ 780 , 01559 , 32 ] 1144 * 158 => 344 1145 */</code></pre> 1146 <p>Like the previous algorithm, we can redefine the upper bound with the 1147 <code>Hoa\Compiler\Llk\Sampler\BoundedExhaustive::setLength</code> method. And 1148 the repetition operators with no upper bounds are bounded to <em>n</em>.</p> 1149 1150 <h4 id="Grammar_coverage-based_generation" for="main-toc">Grammar 1151 coverage-based generation</h4> 1152 1153 <p>The last algorithm is the <strong>grammar coverage-based</strong> 1154 generation. This algorithm reduces the <strong>combinatory explosion</strong> 1155 encountered with the previous algorithm but the goal is different: it will 1156 generate sequences of lexemes that will “<strong>activate</strong>” all the 1157 rules <strong>branches</strong> of the grammar. A rule is said covered if and 1158 only if all its sub-rules are covered, and a lexeme is said covered if it has 1159 been used in a sequence. To ensure a <strong>diversity</strong> in the 1160 produced sequences, choice points between uncovered sub-rules are resolved 1161 <strong>randomly</strong>. The algorithm also behaves like an iterator:</p> 1162 <pre><code class="language-php">$sampler = new Hoa\Compiler\Llk\Sampler\Coverage( 1163 // Grammar. 1164 Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')), 1165 // Token sampler. 1166 new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()) 1167); 1168 1169foreach ($sampler as $i => $data) { 1170 echo $i, ' => ', $data, "\n"; 1171} 1172 1173/** 1174 * Will output: 1175 * 0 => true 1176 * 1 => { " )o?bz " : null , " %3W) " : [ false , 130 , " 6 " ] } 1177 * 2 => [ { " ny " : true } ] 1178 * 3 => { " Ne;[3 " : [ true , true ] , " th: " : true , " C[8} " : true } 1179 */</code></pre> 1180 <p>To avoid the combinatory explosion and to ensure the 1181 <strong>termination</strong> of the algorithm, we will use the following 1182 <strong>heuristic</strong> on the <strong>repetition</strong> operators: 1183 <code>*</code> will repeat <code>0</code>, <code>1</code> and 1184 <code>2</code> times, <code>+</code> will repeat <code>1</code> and 1185 <code>2</code> times, and finally <code>{<em>x</em>,<em>y</em>}</code> will 1186 repeat <code><em>x</em></code>, <code><em>x</em> + 1</code>, 1187 <code><em>y</em> - 1</code> and <code><em>y</em></code> times. This heuristic 1188 finds its origin in the <strong>limit</strong> testing.</p> 1189 <p>We notice in our example that 4 data are generated and are enough to 1190 <strong>cover</strong> entirely our grammar!</p> 1191 1192 <h4 id="Comparison_between_algorithms" for="main-toc">Comparison between 1193 algorithms</h4> 1194 1195 <p>Here are some <strong>clues</strong> to know when to use one algorithm 1196 instead of another.</p> 1197 <dl> 1198 <dt>Random uniform generation:</dt> 1199 <dd><ul> 1200 <li data-item="+">fast for small data, high data diversity and fixed 1201 size,</li> 1202 <li data-item="-">the pre-computation step is exponential and therefore 1203 long while, next, the generation is very fast.</li> 1204 </ul></dd> 1205 <dt>Bounded exhaustive generation:</dt> 1206 <dd><ul> 1207 <li data-item="+">fast for small data and exhaustivity is efficient,</li> 1208 <li data-item="-">exponential number of data.</li> 1209 </ul></dd> 1210 <dt>Grammar coverage-based generation:</dt> 1211 <dd><ul> 1212 <li data-item="+">fast for average or big data, and diversity of 1213 data,</li> 1214 <li data-item="-">do not consider size.</li> 1215 </ul></dd> 1216 </dl> 1217 1218 <h2 id="LL1_compiler-compiler" for="main-toc">LL(1) compiler-compiler</h2> 1219 1220 <p>The documentation of the LL(1) compiler, through the 1221 <code>Hoa\Compiler\Ll1</code> class, is not written yet. The goal is different 1222 regarding <code>Hoa\Compiler\Llk</code>: there is no grammar description 1223 language, automata are hard-coded with arrays and this type of compiler is 1224 high performance oriented. That's why its behavior is 1225 <strong>monolitic</strong>.</p> 1226 1227 <h2 id="Conclusion" for="main-toc">Conclusion</h2> 1228 1229 <p><code>Hoa\Compiler</code> provides two <strong>compiler-compilers</strong>: 1230 <code>Hoa\Compiler\Llk</code> and <code>Hoa\Compiler\Ll1</code>, in order to 1231 <strong>validate</strong> data. The <strong>PP language</strong> allows to 1232 write <strong>algebraic grammars</strong> in a <strong>simple</strong> and 1233 <strong>natural</strong> way. The LL(<em>k</em>) compiler-compiler is split 1234 in <strong>distinct processus</strong>, which encourages hackability. Finally, 1235 several algorithms allows to <strong>generate</strong> data based on a grammar 1236 according to specific usages.</p> 1237 1238</yield> 1239</overlay> 1240