1<?xml version="1.0" encoding="utf-8"?>
2
3<overlay xmlns="http://hoa-project.net/xyl/xylophone">
4<yield id="chapter">
5
6  <p><strong>Compilers</strong> allow to <strong>analyze</strong> and
7  <strong>manipulate textual</strong> data. There is numerous usages.
8  <code>Hoa\Compiler</code> offers to manipulate several compilers based on
9  needs.</p>
10
11  <h2 id="Table_of_contents">Table of contents</h2>
12
13  <tableofcontents id="main-toc" />
14
15  <h2 id="Introduction" for="main-toc">Introduction</h2>
16
17  <blockquote cite="https://en.wikipedia.org/wiki/Nicolas_Boileau-Despr%C3%A9aux">What
18  is conceived well is expressed clearly, and the right words come
19  easily.</blockquote>
20  <p>A <strong>language</strong> is a way to express or to
21  <strong>formulate</strong> a <strong>solution</strong> to a
22  <strong>problem</strong>. And there exists a lot of problems. We read and
23  write in several languages daily, and some of these languages are
24  <strong>understood</strong> by <strong>machines</strong>. This operation is
25  possible thanks to <strong>compilers</strong>.</p>
26  <p><a href="https://en.wikipedia.org/wiki/Formal_language">Formal
27  languages</a> is a theory that includes the <strong>automatic
28  analysis</strong> of those languages through tools like
29  <strong>automata</strong> or <strong>grammars</strong>. It is necessary to
30  have a detailed explanation to understand well all these concepts. However, we
31  are going to popularise as much as needed to allow a correct usage of
32  <code>Hoa\Compiler</code>.</p>
33
34  <h3 id="Language_and_grammar" for="main-toc">Language and grammar</h3>
35
36  <p>A <strong>language</strong> is a set of <strong>words</strong>. Each word
37  is a <strong>sequence</strong> of <strong>symbols</strong> belonging to an
38  <strong>alphabet</strong>. A symbol represents the smallest <strong>lexical
39  unit</strong> of a language, it is atomic and we will call it a
40  <strong>lexeme</strong> (also often mentionned as a token). These sequences of
41  lexemes representing words are built with <strong>rules</strong>. From a word
42  and a root rule, we will try to <strong>derive</strong> its sub-rules. If a
43  derivation exists, then the word is considered as <strong>valid</strong>, else
44  it is considered as <strong>invalid</strong>. We also speak about words
45  <strong>matching</strong>. For example, if we consider the following
46  rules:</p>
47  <pre><code>    exp ::= exp + exp
48          | number
49 number ::= digit number
50          | digit
51digit   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9</code></pre>
52  <p>The word that we are going to match is <code>7 + 35</code>. The root rule
53  is <code><em>exp</em></code>. If we derive (from the left to the right and
54  from the top to the bottom), we can have
55  <code><em>exp</em> + <em>exp</em></code> or <code><em>number</em></code> (the
56  <strong>disjunction</strong>, i.e. the “or”, is represented by the
57  “<code>|</code>” symbol):</p>
58  <pre><code>exp + exp | number
59→ exp + exp
60→ ( exp + exp | number ) + exp
61→ number + exp
62→ ( digit number | digit ) + exp</code></pre>
63  <p>We continue to derive until we <strong>eliminate</strong> every rules in
64  order to have only <strong>lexemes</strong>:</p>
65  <pre><code>…
66→ ( digit number | digit ) + exp
67→ digit + exp
68→ ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) + exp
69→ 7 + exp
70→ 7 + ( exp + exp | number )
71→ 7 + number
72→ 7 + ( digit number | digit )
73→ 7 + digit number
74→ 7 + ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) number
75→ 7 + 3 number
76→ 7 + 3 ( digit number | digit )
77→ 7 + 3 digit
78→ 7 + 3 ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )
79→ 7 + 3 5</code></pre>
80  <p>A derivation exists to match the word <code>7 + 35</code>, thus it is a
81  valid word for those rules.</p>
82  <p>A set of rules is called a <strong>grammar</strong>. And so, a grammar
83  represents a <strong>language</strong>!</p>
84  <p>However, these are different kinds of grammars. In 1956, the
85  <a href="https://en.wikipedia.org/wiki/Chomsky_hierarchy">Chomsky
86  hierarchy</a> has been formulated, classifying grammars in four
87  <strong>levels</strong>:</p>
88  <ol>
89    <li><strong>unrestricted</strong> grammars, matching langages known as
90    Turing languages, rules have no restriction,</li>
91    <li><strong>context-sensitive</strong> grammars, matching contextual
92    languages,</li>
93    <li><strong>context-free</strong> grammars, matching algebraic languages,
94    based on stacked automata,</li>
95    <li><strong>regular</strong> grammars, matching regular languages.</li>
96  </ol>
97  <p>Each level includes the next level. <code>Hoa\Compiler</code> only treats
98  languages defined by grammars of level 3 and 4. To give a quick idea, regular
99  grammars can be connected to
100  <a href="https://en.wikipedia.org/wiki/Regular_expression">regular
101  expressions</a> (like <a href="http://pcre.org/">PCRE</a>), well-known by
102  developers. But regular grammars are not able to match a <strong>pair of
103  symbols</strong> (like parentheses, braces or quotes), while algebraic
104  languages allow that (thanks to the concept of stack of lexemes).</p>
105
106  <h3 id="Matching_words" for="main-toc">Matching words</h3>
107
108  <div id="parsers" class="schema"></div>
109  <script>
110  Hoa.Document.onReady(function ( ) {
111
112      var paper    = Hoa.Graph(Hoa.$('#parsers'), 800, 180);
113      var grid     = paper.grid(0, 0, 800, 180, 5, 2);
114      var word     = grid.push(paper.rect(0, 0, 140, 80, 3, 'word'), 0, 0);
115      var sequence = grid.push(paper.rect(0, 0, 140, 80, 3, 'sequence'), 2, 0);
116      var trace    = grid.push(paper.rect(0, 0, 140, 80, 3, 'result'), 4, 0);
117      grid.push(paper.rect(0, 0, 140, 50, 3, 'abcdef'), 0, 1);
118      grid.push(paper.rect(0, 0, 380, 50, 3, '[[a ⟼ …], [bc ⟼ …], [d ⟼ …], [ef ⟼ …]]'), 2, 1);
119      grid.push(paper.rect(0, 0, 140, 50, 3, 'valid/invalid'), 4, 1);
120
121      paper.link.between(word, sequence, 'lexical analyzer');
122      paper.link.between(sequence, trace, 'syntactic analyzer');
123  });
124  </script>
125  <p>In general, the compilation process begins with two
126  <strong>analyzes</strong>: <strong>lexical</strong> and
127  <strong>syntactic</strong>. A lexical analysis <strong>splits</strong> a word
128  in a <strong>sequence of lexemes</strong>. This sequence will thereafter be
129  used by the syntactic analyzer in order to verify that the word
130  <strong>belongs</strong> to the language.</p>
131  <p>According to the grammar, the matching will not be done in the same way,
132  but the principle stays the same: using lexemes one after the other in the
133  sequence and verify that they allow <strong>going forward</strong> in the
134  <strong>derivation</strong> of grammar's rules.</p>
135  <p>Syntactic analyzes are also classified into categories: LL, LR, LALR etc.
136  <code>Hoa\Compiler</code> only offers LL syntactic analyzers, stands for
137  Left-to-right Leftmost derivation, i.e. from the topest rule to the deepest,
138  and rules are derived from the left to the right. Here, again, there is
139  sub-categories, whose two supported by <code>Hoa\Compiler</code>: LL(1) and
140  LL(*). Globally, we speak about LL(<em>k</em>) syntactic analyzer: if a lexeme
141  does not allow to derive a rule as expected, then the analyzer is allowed to
142  <strong>go back</strong> to <em>k</em> step backward; we also speak about
143  backtrack. Put another way, rules can be <strong>ambiguous</strong>: each time
144  we derive a rule of the grammar, we have several possible choices and the
145  analyzer can be wrong, this is why it needs to backtrack sometimes. The
146  <em>k</em> variable defines the ambiguity <strong>level</strong>. If a grammar
147  can be analyzed by a LL(1) syntactic analyzer, it is known as
148  <strong>unambiguous</strong>: each time a lexeme is used to derive our rules,
149  there is only one choice. And if we have a LL(*) syntactic analyzer, it means
150  that the <em>k</em> variable is <strong>undefined</strong>. The following
151  example illustrates an unambiguous grammar:</p>
152  <pre><code>rule ::= a b c | d e f</code></pre>
153  <p>And this example illustrates an ambiguous grammar:</p>
154  <pre><code>rule1 ::= a rule2
155rule2 ::= b rule3 | b rule4
156rule3 ::= c d
157rule4 ::= e f</code></pre>
158  <p>Let's see what happens when trying to find a derivation for the word
159  <code>abef</code> from the root rule <code>rule1</code>:</p>
160  <pre><code>rule1
161→ a rule2                   <em>   a  bef ✔</em>
162  → a (b rule3 | b rule4)   <em>   a  bef</em>
163    → a b rule3             <em>  ab  ef  ✔</em>
164      → a b c d             <em> abe  f   ✖</em>
165    ← a b rule3             <em>  ab  ef  ✖</em>
166  ← a (b rule3 | b rule4)   <em>   a  bef</em>
167    → a b rule4             <em>  ab  ef  ✔</em>
168      → a b e f             <em>abef      ✔</em></code></pre>
169  <p>The <code>rule2</code> rule is ambiguous, which can lead to a bad
170  derivation and consequently a backtrack.</p>
171
172  <h2 id="LLk_compiler-compiler" for="main-toc">LL(<em>k</em>)
173  compiler-compiler</h2>
174
175  <p>Writing compilers is a <strong>laborious</strong> task. It is not always
176  difficult but most of the time, it is repetitive and it takes time. This is
177  why there exists <strong>compilers of compilers</strong>, or in other words,
178  compilers generators. Most of the time, these compiler-compilers use an
179  <strong>intermediary</strong> language to write a grammar. The
180  <code>Hoa\Compiler</code> library provides the
181  <code>Hoa\Compiler\Llk\Llk</code> class that allows the writing of
182  compiler-compilers through a <strong>dedicated</strong> language.</p>
183
184  <h3 id="PP_language" for="main-toc">PP language</h3>
185
186  <p>The PP language, standing for <em>PHP Parser</em>, allows to express
187  <strong>algebraic grammars</strong>. It is written in files having the
188  <code>.pp</code> extension (please, see the
189  <a href="@central_resource:path=Library/Compiler/.Mime"><code>hoa://Library/Compiler/.Mime</code></a>
190  file).</p>
191  <p>A grammar is composed of <strong>lexemes</strong> and
192  <strong>rules</strong>. The declaration of a lexeme has the following syntax:
193  <code>%token <em>namespace_in</em>:<em>name</em> <em>value</em> ->
194  <em>namespace_out</em></code>, where <code><em>name</em></code> represents the
195  <strong>name</strong> of the lexeme, <code><em>value</em></code> represents
196  its <strong>value</strong>, expressed with the
197  <a href="http://pcre.org/">PCRE</a> format (take care to not match an empty
198  value, in which case an exception will be thrown), and
199  <code><em>namespace_in</em></code> and <code><em>namespace_out</em></code>
200  represent the names of <strong>namespaces</strong> and are optional (the
201  default name is <code>default</code>). For example <code>number</code>
202  describing a number composed of digits from <code>0</code> to
203  <code>9</code>:</p>
204  <pre><code class="language-pp">%token number \d+</code></pre>
205  <p>Namespaces represent disjointed <strong>subsets</strong> of lexemes, used
206  to <strong>ease</strong> analyzes. A <code>%skip</code> declaration is similar
207  to <code>%token</code> excepts that it represents a lexeme to
208  <strong>skip</strong>, it means to not consider. A common example of
209  <code>%skip</code> lexemes are spaces:</p>
210  <pre><code class="language-pp">%skip space \s</code></pre>
211  <p>To explain rules, we will use the <code>Json.pp</code> grammar as an
212  example, which is a softly <strong>simplified</strong> version of the
213  <a href="http://json.org/">JSON language</a> (please, see the
214  <a href="https://tools.ietf.org/html/rfc4627">RFC4627</a>). The
215  <strong>complete</strong> grammar is located in the
216  <a href="@central_resource:path=Library/Json/Grammar.pp"><code>hoa://Library/Json/Grammar.pp</code></a>
217  file. Thus:</p>
218  <pre><code class="language-pp">%skip   space          \s
219// Scalars.
220%token  true           true
221%token  false          false
222%token  null           null
223// Strings.
224%token  quote_         "        -> string
225%token  string:string  [^"]+
226%token  string:_quote  "        -> default
227// Objects.
228%token  brace_         {
229%token _brace          }
230// Arrays.
231%token  bracket_       \[
232%token _bracket        \]
233// Rest.
234%token  colon          :
235%token  comma          ,
236%token  number         \d+
237
238value:
239    &amp;lt;true> | &amp;lt;false> | &amp;lt;null> | string() | object() | array() | number()
240
241string:
242    ::quote_:: &amp;lt;string> ::_quote::
243
244number:
245    &amp;lt;number>
246
247#object:
248    ::brace_:: pair() ( ::comma:: pair() )* ::_brace::
249
250#pair:
251    string() ::colon:: value()
252
253#array:
254    ::bracket_:: value() ( ::comma:: value() )* ::_bracket::</code></pre>
255  <p>Notice that we have two namespaces of lexemes: <code>default</code> and
256  <code>string</code> (it allows to avoid <strong>confusion</strong> with the
257  <code>quote_</code> and <code>string::_quote</code> lexemes that have the same
258  representation). We also notice the <code>value</code> rule, which is a
259  <strong>disjunction</strong> of several lexemes and rules. The
260  <strong>constructions</strong> of the PP language are the following:</p>
261  <ul>
262    <li><code><em>rule</em>()</code> to <strong>call</strong> a rule,</li>
263    <li><code>&amp;lt;<em>token</em>></code> and <code>::<em>token</em>::</code>
264    to <strong>declare</strong> a lexeme (namespaces do not appear here),</li>
265    <li><code>|</code> for a <strong>disjunction</strong> (a choice),</li>
266    <li><code>(<em>…</em>)</code> for a group,</li>
267    <li><code><em>e</em>?</code> to say that <code><em>e</em></code> is
268    <strong>optional</strong>,</li>
269    <li><code><em>e</em>+</code> to say that <code><em>e</em></code> can be
270    present <strong>1 or more</strong> times,</li>
271    <li><code><em>e</em>*</code> to say that <code><em>e</em></code> can be
272    present <strong>0 or more</strong> times,</li>
273    <li><code><em>e</em>{<em>x</em>,<em>y</em>}</code> to say that
274    <code><em>e</em></code> can be present between <code><em>x</em></code> and
275    <code><em>y</em></code> times,</li>
276    <li><code>#<em>node</em></code> to create a <strong>node</strong>
277    <code>node</code> in the resulting tree,</li>
278    <li><code><em>token</em>[<em>i</em>]</code> to <strong>unify</strong>
279    lexemes between each others.</li>
280  </ul>
281  <p>Few constructions but largely enough.</p>
282  <p>Finally, the grammar of the PP language is written… with the PP language!
283  You can find it in the
284  <a href="@central_resource:path=Library/Compiler/Llk/Llk.pp"><code>hoa://Library/Compiler/Llk/Llk.pp</code></a>
285  file.</p>
286
287  <h3 id="Compilation_process" for="main-toc">Compilation process</h3>
288
289  <div id="overview" class="schema"></div>
290  <script>
291  Hoa.Document.onReady(function ( ) {
292
293      var paper = Hoa.Graph(Hoa.$('#overview'), 800, 360);
294      var flow  = paper.grid(0, 0, 800, 360, 1, 4);
295      flow.push(paper.rect(0, 0, 200, 70, 3, 'lexical analyzer'));
296      flow.push(paper.rect(0, 0, 200, 70, 3, 'syntactic analyzer'));
297      flow.push(paper.rect(0, 0, 200, 70, 3, 'trace'));
298      flow.push(paper.rect(0, 0, 200, 70, 3, 'AST'));
299
300      var annot = paper.grid(180, 0, 80, 360, 1, 4);
301      annot.push(paper.rect(0, 0, 45, 45, 22.5, '1'));
302      annot.push(paper.rect(0, 0, 45, 45, 22.5, '2'));
303      annot.push(paper.rect(0, 0, 45, 45, 22.5, '3'));
304      annot.push(paper.rect(0, 0, 45, 45, 22.5, '4'));
305  });
306  </script>
307  <p>The compilation process used by <code>Hoa\Compiler\Llk\Llk</code> is
308  classical: it starts by <strong>lexically</strong> analyzing the textual data,
309  the word, i.e. to transform our data in a sequence of lexemes. The declaration
310  <strong>order</strong> of the lexemes is primordial because the lexical
311  analyzer will use them one after the other. Next, the
312  <strong>syntactic</strong> analyzer comes into play in order to
313  <strong>match</strong> our data.</p>
314  <p>If the syntactic analysis is successful, we obtain a
315  <strong>trace</strong>.  This trace can be transformed into an AST, stands for
316  Abstract Syntax Tree.  This tree represents our textual data after the
317  analysis. One advantage is that it can visited (we will detail this part
318  later), which allows us to add new <strong>constraints</strong> which can be
319  not be expressed in the grammar, such as type verification.</p>
320  <p>Let's manipulate <code>Hoa\Compiler\Llk\Llk</code> a little bit. This class
321  is a <strong>helper</strong> to easily read a grammar expressed with the PP
322  format. This class takes only one argument, which is an input stream that
323  points to the grammar and returns a <code>Hoa\Compiler\Llk\Parser</code>
324  compiler ready to use. On this compiler, we will call the
325  <code>Hoa\Compiler\Llk\Parser::parse</code> to analyze a JSON data. If the
326  data is correct, we will get an AST, otherwise an exception will be thrown.
327  Finally, we will use the <code>Hoa\Compiler\Visitor\Dump</code> visitor to
328  print our AST:</p>
329  <pre><code class="language-php">// 1. Load grammar.
330$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
331
332// 2. Parse a data.
333$ast      = $compiler->parse('{"foo": true, "bar": [null, 42]}');
334
335// 3. Dump the AST.
336$dump     = new Hoa\Compiler\Visitor\Dump();
337echo $dump->visit($ast);
338
339/**
340 * Will output:
341 *     >  #object
342 *     >  >  #pair
343 *     >  >  >  token(string:string, foo)
344 *     >  >  >  token(true, true)
345 *     >  >  #pair
346 *     >  >  >  token(string:string, bar)
347 *     >  >  >  #array
348 *     >  >  >  >  token(null, null)
349 *     >  >  >  >  token(number, 42)
350 */</code></pre>
351  <p>When we write and test a grammar, we will repeat these three tasks very
352  <strong>often</strong>. This is why the <code>hoa</code> script provides the
353  <code>compiler:pp</code> command. This command proposes to analyze a data
354  based on a grammar and to apply a visitor if needed on the resulting AST.
355  Notice that the grammar can be read through a
356  <a href="https://en.wikipedia.org/wiki/Pipeline_(Unix)">pipe</a>:</p>
357  <pre><code class="language-shell">$ echo '[1, [1, [2, 3], 5], 8]' | hoa compiler:pp Json.pp 0 --visitor dump
358>  #array
359>  >  token(number, 1)
360>  >  #array
361>  >  >  token(number, 1)
362>  >  >  #array
363>  >  >  >  token(number, 2)
364>  >  >  >  token(number, 3)
365>  >  >  token(number, 5)
366>  >  token(number, 8)</code></pre>
367  <p>This is a useful way to <strong>quickly</strong> test a data against a
368  grammar. Do not hesitate to read the help of the <code>compiler:pp</code>
369  command to get more details!</p>
370  <p>The analyzes start with the <strong>root</strong> rule but we can use
371  <strong>another rule</strong> thanks to the second argument of the
372  <code>Hoa\Compiler\Llk\Parser::parse</code> method:</p>
373  <pre><code class="language-php">$compiler->parse('{"foo": true, "bar": [null, 42]}', 'object');</code></pre>
374  <p>To use the root rule, we have to use <code>null</code>.</p>
375  <p>And finally, to not generate the AST but only know if a data is valid or
376  not, we can use the last argument of this method by using
377  <code>false</code>:</p>
378  <pre><code class="language-php">$valid = $compiler->parse('{"foo": true, "bar": [null, 42]}', null, false);
379var_dump($valid);
380
381/**
382 * Will output:
383 *     bool(true)
384 */</code></pre>
385
386  <h4 id="Errors" for="main-toc">Errors</h4>
387
388  <p>The compilation errors are represented by exceptions, thus:</p>
389  <pre><code class="language-shell">$ echo '{"foo" true}' | hoa compiler:pp Json.pp 0 --visitor dump
390Uncaught exception (Hoa\Compiler\Exception\UnexpectedToken):
391Hoa\Compiler\Llk\Parser::parse(): (0) Unexpected token "true" (true) at line 1
392and column 8:
393{"foo" true}
394       ↑
395in hoa://Library/Compiler/Llk/Parser.php at line 1</code></pre>
396  <p>Several exceptions can be thrown according to the context:</p>
397  <ul>
398    <li>during the <strong>lexical</strong> analysis,
399    <code>Hoa\Compiler\Exception\UnrecognizedToken</code> when a lexeme is not
400    recognized, i.e. when the textual data can not be split into a sequence of
401    lexemes, and <code>Hoa\Compiler\Exception\Lexer</code> when other more
402    general errors happen, for example if a lexeme matches an empty value,</li>
403    <li>during the <strong>syntactic</strong> analysis,
404    <code>Hoa\Compiler\Exception\UnexpectedToken</code> when a lexeme is not
405    expected, i.e. it is not able to derive the rules of the grammar.</li>
406  </ul>
407  <p>The parent exception is <code>Hoa\Compiler\Exception\Exception</code>.</p>
408
409  <h4 id="Nodes" for="main-toc">Nodes</h4>
410
411  <p>The compilation process very often succeeds by the
412  <strong>production</strong> of an AST. It is important to control its
413  <strong>form</strong>, its <strong>size</strong>, the data it
414  <strong>holds</strong> etc. This is why it is necessary to understand the
415  <code>#<em>node</em></code> notation because it allows to create
416  <strong>nodes</strong> in the AST. First, lexemes declared with the
417  <code>&amp;lt;<em>token</em>></code> syntax will appear in the tree, while the
418  other lexemes, declared with the <code>::<em>token</em>::</code> syntax, will
419  not appear. Indeed, in our last example, the <code>quote_</code>,
420  <code>brace_</code>, <code>colon</code>, <code>comma</code> etc. lexemes do
421  not appear. Next, it is important to notice that the declaration of nodes can
422  be <strong>overwritten</strong> inside a <strong>same rule</strong>. Finally,
423  a rule name can be preceeded by <code>#</code>, just like the
424  <code>#array</code> rule, that allows to define a <strong>default</strong>
425  node, but it can be overwritten. For example, if we replace the
426  <code>array</code> rule by:</p>
427  <pre><code class="language-pp">#array:
428    ::bracket_:: value() ( ::comma:: value() #bigarray )* ::_bracket::</code></pre>
429  <p>If the array contains only one value, the node will be named
430  <code>#array</code>, else it will be named <code>#bigarray</code>. Let's
431  see:</p>
432  <pre><code class="language-shell">$ echo '[42]' | hoa compiler:pp Json.pp 0 --visitor dump
433>  #array
434>  >  token(number, 42)
435$ echo '[4, 2]' | hoa compiler:pp Json.pp 0 --visitor dump
436>  #bigarray
437>  >  token(number, 4)
438>  >  token(number, 2)</code></pre>
439  <p>Of course, a node can be created or not based on the
440  <strong>derivation</strong> of a rule. The mecanism is normally pretty much
441  <strong>intuitive</strong>.</p>
442
443  <h4 id="Namespaces" for="main-toc">Namespaces</h4>
444
445  <p>Let's detail a little bit the behavior of the lexical analyzer regarding
446  the <strong>namespaces</strong>.</p>
447  <p><strong>Lexemes</strong> are put in namespaces, it means that only the
448  lexemes of the <strong>current</strong> namespace are used by the lexical
449  analyzer. The default namespace is <code>default</code>. To declare the
450  namespace of a lexeme, we have to use the <code>:</code> operator. When a
451  lexeme is consumed, it is able to <strong>change</strong> the current
452  namespace for the rest of the lexical analysis, thanks to the <code>-></code>
453  operator. Thus, we will declare five lexemes: <code>foo</code> and
454  <code>bar</code> in the <code>default</code> namespace, <code>baz</code> in
455  <code>ns1</code> and <code>qux</code> and <code>foo</code> in
456  <code>ns2</code>. The fact that <code>foo</code> is declared two times is not
457  embarrassing because it is put in <strong>different</strong> namespaces:</p>
458  <pre><code class="language-pp">%token      foo   fo+     -> ns1
459%token      bar   ba?r+   -> ns2
460%token  ns1:baz   ba?z+   -> default
461%token  ns2:qux   qux+
462%token  ns2:foo   FOO     -> ns1</code></pre>
463  <p>Writing <code>default:bar</code> is strictly equivalent to
464  <code>bar</code>. The <code>foo</code> lexeme leads into <code>ns1</code>,
465  <code>bar</code> into <code>ns2</code>, <code>ns1:baz</code> into
466  <code>default</code>, <code>ns2:qux</code> stays in <code>ns2</code> and
467  <code>ns2:foo</code> leads into <code>ns1</code>. Let's observe the sequence
468  of lexemes produced by the lexical analyzer with the following data
469  <code>fooooobzzbarrrquxFOObaz</code>:</p>
470  <pre><code class="language-php">$pp = '%token      foo   fo+     -> ns1'     . "\n" .
471      '%token      bar   ba?r+   -> ns2'     . "\n" .
472      '%token  ns1:baz   ba?z+   -> default' . "\n" .
473      '%token  ns2:qux   qux+'               . "\n" .
474      '%token  ns2:foo   FOO     -> ns1';
475
476// 1. Parse PP.
477Hoa\Compiler\Llk\Llk::parsePP($pp, $tokens, $rawRules);
478
479// 2. Run the lexical analyzer.
480$lexer    = new Hoa\Compiler\Llk\Lexer();
481$sequence = $lexer->lexMe('fooooobzzbarrrquxFOObaz', $tokens);
482
483// 3. Pretty-print the result.
484$format   = '%' . (strlen((string) count($sequence)) + 1) . 's  ' .
485            '%-13s %-20s  %-20s  %6s' . "\n";
486$header   = sprintf($format, '#', 'namespace', 'token name', 'token value', 'offset');
487
488echo $header, str_repeat('-', strlen($header)), "\n";
489
490foreach ($sequence as $i => $token) {
491    printf(
492        $format,
493        $i,
494        $token['namespace'],
495        $token['token'],
496        $token['value'],
497        $token['offset']
498    );
499}
500
501/**
502 * Will output:
503 *      #  namespace     token name            token value           offset
504 *     ---------------------------------------------------------------------
505 *      0  default       foo                   fooooo                     0
506 *      1  ns1           baz                   bzz                        6
507 *      2  default       bar                   barrr                      9
508 *      3  ns2           qux                   qux                       14
509 *      4  ns2           foo                   FOO                       17
510 *      5  ns1           baz                   baz                       20
511 *      6  default       EOF                   EOF                       23
512 */</code></pre>
513  <p>We read the lexemes, their namespace and their default value in the table.
514  The data has been successfully <strong>splitted</strong>, and we have jumped
515  from one namespace to the other. If this time we try with the
516  <code>foqux</code> data, we will have an error: <code>fo</code> is matched by
517  the <code>foo</code> lexeme in the <code>default</code> namespace, we then
518  jump to the <code>ns1</code> namespace, and here, no lexeme in this namespace
519  is able to recognize at least <code>q</code>. An error is thrown.</p>
520  <p>So far, we have seen how to jump from one namespace to another with the
521  <code>-></code> operator. No <strong>history</strong> about the used
522  namespaces is stored. However, in some rare cases, it is able for a lexeme to
523  be accessible through <strong>several</strong> namespaces and we would like
524  that a lexeme jumps back to the <strong>previous</strong> namespace. Put
525  another way, we would like a history of the used namespaces and being able to
526  navigate (backward only) in it. This is the role of the
527  <code>__shift__ * <em>n</em></code> keyword, in conjunction with the
528  <code>-></code> operator as usual. <code>__shift__</code> is equivalent to
529  say: jump back to the previous namespace. <code>__shift__</code> is equivalent
530  to <code>__shift__ * 1</code>, and <code>__shift__ * <em>n</em></code> to say:
531  jump back <code><em>n</em></code> times to the previous namespace.</p>
532  <p>When the <code>__shift__</code> keyword appears in the grammar, namespaces
533  are managed like a <strong>stack</strong>, hence the <em>shift</em>
534  vocabulary. We have to take care to shift namespaces regularly in order to
535  avoid a huge stack.</p>
536  <p>Let's take the example of the following grammar:</p>
537  <pre><code class="language-pp">%token       foo1  a    -> ns1
538%token       foo2  x    -> ns2
539%token       end   e
540
541%token   ns1:bar   b
542%token   ns1:baz   c    -> ns3
543%token   ns1:tada  t    -> __shift__
544
545%token   ns2:bar   y
546%token   ns2:baz   z    -> ns3
547%token   ns2:tada  T    -> __shift__
548
549%token   ns3:qux   =   -> __shift__
550
551#test:
552    ( &amp;lt;foo1> | &amp;lt;foo2> ) &amp;lt;bar> &amp;lt;baz> &amp;lt;qux> &amp;lt;tada> &amp;lt;end></code></pre>
553  <p>This grammar recognizes two data: <code>abc=te</code> and
554  <code>xyz=Te</code>. If the first lexeme <code>foo1</code> is matched, the
555  syntactic analyzer will jump to the <code>ns1</code> namespace. When it will
556  match the <code>ns1:baz</code> namespace, it will jump into <code>ns3</code>.
557  Then, when it will match <code>ns3:qux</code>, it will jump back to the
558  previous namespace, being <code>ns1</code>. And so on. Now, if it does not
559  match <code>foo1</code> but <code>foo2</code>, it will jump to the
560  <code>ns2</code> namespace, and then in <code>ns3</code> and then in
561  <code>ns2</code> again. The <code>__shift__</code> keywords in
562  <code>ns1:tada</code> and <code>ns2:tada</code> are only used to shift
563  namespaces but no ambiguity exists: these namespaces are accessible by only
564  one namespace, namely <code>default</code>.</p>
565  <p>Now, let's save this grammar in the <code>NamespaceStack.pp</code> file and
566  use the <code>-s/--token-sequence</code> option of the <code>hoa
567  compiler:pp</code> command:</p>
568  <pre><code class="language-shell">$ echo -n 'abc=te' | hoa compiler:pp NamespaceStack.pp 0 --token-sequence
569 #  namespace     token name            token value                     offset
570-------------------------------------------------------------------------------
571 0  default       foo1                  a                                    0
572 1  ns1           bar                   b                                    1
573 2  ns1           baz                   c                                    2
574 3  ns3           qux                   =                                    3
575 4  ns1           tada                  t                                    4
576 5  default       end                   e                                    5
577 6  default       EOF                   EOF                                  6
578
579$ echo -n 'xyz=Te' | hoa compiler:pp NamespaceStack.pp 0 --token-sequence
580 #  namespace     token name            token value                     offset
581-------------------------------------------------------------------------------
582 0  default       foo2                  x                                    0
583 1  ns2           bar                   y                                    1
584 2  ns2           baz                   z                                    2
585 3  ns3           qux                   =                                    3
586 4  ns2           tada                  T                                    4
587 5  default       end                   e                                    5
588 6  default       EOF                   EOF                                  6</code></pre>
589  <p>We see that the lexical analyzer has successfully jumped between all
590  namespaces, as expected. We had two ways to access to the <code>ns3</code>
591  namespace: either from <code>ns1</code> or from <code>ns2</code>. The analyzer
592  has succeeded to create a history of namespaces and navigate in it.</p>
593
594  <h4 id="Unification" for="main-toc">Unification</h4>
595
596  <p>A new feature brought by the PP language regarding other grammar languages
597  is the capability to express a <strong>unification</strong> of lexemes. Let's
598  imagine the following grammar:</p>
599  <pre><code class="language-pp">%token  quote   '|"
600%token  string  \w+
601
602rule:
603    ::quote:: &amp;lt;string> ::quote::</code></pre>
604  <p>The quotes surrounding the string can be of two kinds: simple, with the
605  “<code>'</code>” symbol, or double, with the “<code>"</code>” symbol. Thus,
606  the <code>"foo"</code> and <code>'foo'</code> data are valid, but
607  <strong>also</strong> <code>"foo'</code> and <code>'foo"</code>!</p>
608  <p>The unification of lexemes allows to add an additionnal
609  <strong>constraint</strong> on the <strong>value</strong> of lexemes at
610  runtime. The syntax is the following: <code><em>token</em>[<em>i</em>]</code>.
611  The value of <code><em>i</em></code> indicates what lexemes will have the same
612  value. And finally, the unification is <strong>locale</strong> to an instance
613  of a rule, it means that there is no unification between lexemes of different
614  rules and the unification is only applied on the <strong>current</strong>
615  rule. Thus, the example becomes:</p>
616  <pre><code class="language-pp">rule:
617    ::quote[0]:: &amp;lt;string> ::quote[0]::</code></pre>
618  <p>Which invalidates the <code>"foo'</code> and <code>'foo"</code> data.</p>
619  <p>Unification finds numerous applications, such as the name of opening and
620  closing tags in the <a href="http://w3.org/TR/xml11/">XML language</a>.</p>
621
622  <h3 id="Abstract_Syntax_Tree" for="main-toc">Abstract Syntax Tree</h3>
623
624  <div id="overview_ast" class="schema"></div>
625  <script>
626  Hoa.Document.onReady(function ( ) {
627
628      var paper = Hoa.Graph(Hoa.$('#overview_ast'), 800, 360);
629      var flow  = paper.grid(0, 0, 800, 360, 1, 4);
630      flow.push(paper.rect(0, 0, 200, 70, 3, 'lexical analyzer').attr({opacity: .3}));
631      flow.push(paper.rect(0, 0, 200, 70, 3, 'syntactic analyzer').attr({opacity: .3}));
632      flow.push(paper.rect(0, 0, 200, 70, 3, 'trace').attr({opacity: .3}));
633      flow.push(paper.rect(0, 0, 200, 70, 3, 'AST'));
634
635      var annot = paper.grid(180, 0, 80, 360, 1, 4);
636      annot.push(paper.rect(0, 0, 45, 45, 22.5, '1').attr({opacity: .3}));
637      annot.push(paper.rect(0, 0, 45, 45, 22.5, '2').attr({opacity: .3}));
638      annot.push(paper.rect(0, 0, 45, 45, 22.5, '3').attr({opacity: .3}));
639      annot.push(paper.rect(0, 0, 45, 45, 22.5, '4'));
640  });
641  </script>
642  <p>An <strong>abstract syntax</strong> tree represents a textual data in a
643  <strong>structural</strong> form. Each <strong>node</strong> of this tree is
644  represented by the <code>Hoa\Compiler\Llk\TreeNode</code> class. Among the
645  most useful methods, we find:</p>
646  <ul>
647    <li><code>getId</code> to get the identifier of the node,</li>
648    <li><code>getValueToken</code> to get the name of the lexeme,</li>
649    <li><code>getValueValue</code> to get the value of the lexeme,</li>
650    <li><code>isToken</code> if the token represents a lexeme,</li>
651    <li><code>getChild(<em>$i</em>)</code> to get the
652    <code><em>$i</em></code>-th child of the node,</li>
653    <li><code>getChildren</code> to get all the children,</li>
654    <li><code>getChildrenNumber</code> to count the number of children,</li>
655    <li><code>getParent</code> to get the parent node,</li>
656    <li><code>getData</code> to get a reference to a data array,</li>
657    <li><code>accept</code> to apply the visitor.</li>
658  </ul>
659  <p>Visitors are the most simple way to <strong>cross</strong> an AST. As an
660  example, let's write the <code>PrettyPrinter</code> visitor which will rewrite
661  a JSON data with our own conventions (spacing, line breaking etc.). A visitor
662  must implement the <code>Hoa\Visitor\Visit</code> interface and will have only
663  one method to implement: <code>visit</code> with three arguments:
664  the element and two optional arguments (by reference and by copy). Let's
665  see:</p>
666  <pre><code class="language-php">class PrettyPrinter implements Hoa\Visitor\Visit
667{
668    public function visit (
669        Hoa\Visitor\Element $element,
670        &amp;amp;$handle = null,
671        $eldnah = null
672    ) {
673        static $i      = 0;
674        static $indent = '    ';
675
676        // One behaviour per node in the AST.
677        switch ($element->getId()) {
678            // Object: { … }.
679            case '#object':
680                echo '{', "\n";
681                ++$i;
682
683                foreach ($element->getChildren() as $e => $child) {
684                    if (0 &amp;lt; $e) {
685                        echo ',', "\n";
686                    }
687
688                    echo str_repeat($indent, $i);
689                    $child->accept($this, $handle, $eldnah);
690                }
691
692                echo "\n", str_repeat($indent, --$i), '}';
693
694                break;
695
696            // Array: [ … ].
697            case '#array':
698                echo '[', "\n";
699                ++$i;
700
701                foreach ($element->getChildren() as $e => $child) {
702                    if (0 &amp;lt; $e) {
703                        echo ',', "\n";
704                    }
705
706                    echo str_repeat($indent, $i);
707                    $child->accept($this, $handle, $eldnah);
708                }
709
710                echo "\n", str_repeat($indent, --$i), ']';
711
712                break;
713
714            // Pair: "…": ….
715            case '#pair':
716                echo
717                    $element->getChild(0)->accept($this, $handle, $eldnah),
718                    ': ',
719                    $element->getChild(1)->accept($this, $handle, $eldnah);
720
721                break;
722
723            // Many tokens.
724            case 'token':
725                switch ($element->getValueToken()) {
726                    // String: "…".
727                    case 'string':
728                        echo '"', $element->getValueValue(), '"';
729
730                        break;
731
732                    // Booleans.
733                    case 'true':
734                    case 'false':
735
736                    // Null.
737                    case 'null':
738
739                    // Number.
740                    case 'number':
741                        echo $element->getValueValue();
742
743                        break;
744                }
745
746                break;
747        }
748    }
749}</code></pre>
750  <p>We are going to see an example:</p>
751  <pre><code class="language-php">$compiler    = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
752$ast         = $compiler->parse('{"foo": true, "bar": [null, 42]}');
753$prettyPrint = new PrettyPrinter();
754echo $prettyPrint->visit($ast);
755
756/**
757 * Will output:
758 *     {
759 *         "foo": true,
760 *         "bar": [
761 *             null,
762 *             42
763 *         ]
764 *     }
765 */</code></pre>
766  <p>The <code>getData</code> method is very useful to <strong>store</strong>
767  data that are likely to be <strong>reused</strong>, for example from one
768  visitor to another. This method returns a <strong>reference</strong> to an
769  array; thus:</p>
770  <pre><code class="language-php">$data = $element->getData();
771
772if (!isset($data['previousComputing'])) {
773    throw new Exception('Need a previous computing.', 0);
774}
775
776$previous = $data['previousComputing'];</code></pre>
777  <p>It is common to use one visitor per <strong>constraint</strong>:
778  verification of symbols, verification of types etc. Some of them can set data
779  for the next ones. The <code>getData</code> method does not require any
780  data structure, it only provides an access to an array. Feel free to do what
781  you want with it.</p>
782  <p>Using the <code>Hoa\Compiler\Llk\TreeNode</code> is very
783  <strong>trivial</strong> and we will use it most of the time with a
784  visitor.</p>
785
786  <h3 id="Traces" for="main-toc">Traces</h3>
787
788  <div id="overview_trace" class="schema"></div>
789  <script>
790  Hoa.Document.onReady(function ( ) {
791
792      var paper = Hoa.Graph(Hoa.$('#overview_trace'), 800, 360);
793      var flow  = paper.grid(0, 0, 800, 360, 1, 4);
794      flow.push(paper.rect(0, 0, 200, 70, 3, 'lexical analyzer').attr({opacity: .3}));
795      flow.push(paper.rect(0, 0, 200, 70, 3, 'syntactic analyzer').attr({opacity: .3}));
796      flow.push(paper.rect(0, 0, 200, 70, 3, 'trace'));
797      flow.push(paper.rect(0, 0, 200, 70, 3, 'AST').attr({opacity: .3}));
798
799      var annot = paper.grid(180, 0, 80, 360, 1, 4);
800      annot.push(paper.rect(0, 0, 45, 45, 22.5, '1').attr({opacity: .3}));
801      annot.push(paper.rect(0, 0, 45, 45, 22.5, '2').attr({opacity: .3}));
802      annot.push(paper.rect(0, 0, 45, 45, 22.5, '3'));
803      annot.push(paper.rect(0, 0, 45, 45, 22.5, '4').attr({opacity: .3}));
804  });
805  </script>
806  <p>The LL(<em>k</em>) compiler provided by Hoa is clearly composed of
807  <strong>layers</strong>. Each layer can be used with a maximal modularity and
808  extensibility. When the grammar is translated into “machine rules” and that
809  the lexical and syntactic analyzers have validated a data, a
810  <strong>trace</strong> is computed. This trace is a flat array containing only
811  three kind of classes:</p>
812  <ul>
813    <li><code>Hoa\Compiler\Llk\Rule\Entry</code> when the syntactic analyzer has
814    opened a rule,</li>
815    <li><code>Hoa\Compiler\Llk\Rule\Ekzit</code> when the syntactic analyzer has
816    closed a rule,</li>
817    <li><code>Hoa\Compiler\Llk\Rule\Token</code> when the syntactic analyzer has
818    encountered a lexeme.</li>
819  </ul>
820  <p>We can get the trace with the
821  <code>Hoa\Compiler\Llk\Parser::getTrace</code> method. To understand this
822  trace, we will start with an example:</p>
823  <pre><code class="language-php">$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
824$ast      = $compiler->parse('{"foo": true, "bar": [null, 42]}');
825$i        = 0;
826
827foreach ($compiler->getTrace() as $element) {
828    if ($element instanceof Hoa\Compiler\Llk\Rule\Entry) {
829        echo str_repeat('>   ', ++$i), 'enter ', $element->getRule(), "\n";
830    } elseif ($element instanceof Hoa\Compiler\Llk\Rule\Token) {
831        echo
832            str_repeat('    ', $i + 1), 'token ', $element->getTokenName(),
833            ', consumed ', $element->getValue(), "\n";
834    } else {
835        echo str_repeat('&amp;lt;   ', $i--), 'ekzit ', $element->getRule(), "\n";
836    }
837}
838
839/**
840 * Will output:
841 *     >   enter value
842 *     >   >   enter object
843 *                 token brace_, consumed {
844 *     >   >   >   enter pair
845 *     >   >   >   >   enter string
846 *                         token quote_, consumed "
847 *                         token string, consumed foo
848 *                         token _quote, consumed "
849 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit string
850 *                     token colon, consumed :
851 *     >   >   >   >   enter value
852 *                         token true, consumed true
853 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit value
854 *     &amp;lt;   &amp;lt;   &amp;lt;   ekzit pair
855 *     >   >   >   enter 13
856 *     >   >   >   >   enter 12
857 *                         token comma, consumed ,
858 *     >   >   >   >   >   enter pair
859 *     >   >   >   >   >   >   enter string
860 *                                 token quote_, consumed "
861 *                                 token string, consumed bar
862 *                                 token _quote, consumed "
863 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit string
864 *                             token colon, consumed :
865 *     >   >   >   >   >   >   enter value
866 *     >   >   >   >   >   >   >   enter array
867 *                                     token bracket_, consumed [
868 *     >   >   >   >   >   >   >   >   enter value
869 *                                         token null, consumed null
870 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit value
871 *     >   >   >   >   >   >   >   >   enter 21
872 *     >   >   >   >   >   >   >   >   >   enter 20
873 *                                             token comma, consumed ,
874 *     >   >   >   >   >   >   >   >   >   >   enter value
875 *                                                 token number, consumed 42
876 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit value
877 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit 20
878 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit 21
879 *                                     token _bracket, consumed ]
880 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit array
881 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit value
882 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit pair
883 *     &amp;lt;   &amp;lt;   &amp;lt;   &amp;lt;   ekzit 12
884 *     &amp;lt;   &amp;lt;   &amp;lt;   ekzit 13
885 *                 token _brace, consumed }
886 *     &amp;lt;   &amp;lt;   ekzit object
887 *     &amp;lt;   ekzit value
888 */</code></pre>
889  <p>This example reveals several things. First, the informations given by the
890  trace can be useful: if we are on a rule, we have its <strong>name</strong>
891  (with the <code>getRule</code> method), and if we are on a lexeme, we have its
892  <strong>name</strong> (with the <code>getTokenName</code> method), its
893  <strong>representation</strong> (expressed in the PCRE format, with the
894  <code>getRepresentation</code> method), its <strong>value</strong> (with the
895  <code>getValue</code> method), if it is a node to <strong>keep</strong> in the
896  AST (with the <code>isKept</code> method) and its <strong>unification</strong>
897  index if it exists (with the <code>getUnificationIndex</code> method). Of
898  course, all these informations can be edited, which can influence the
899  construction of the AST which is the (optional) next step.</p>
900  <p>Next, we notice that sometimes we enter in a rule that exists in the
901  grammar, like <code>object</code>, <code>pair</code>, <code>value</code> etc.,
902  and sometimes we enter in a <strong>numerical</strong> rule, like
903  <code>13</code>, <code>12</code>, <code>21</code>, <code>20</code> etc. When
904  the grammar is interpreted in order to be translated into “machine rules”,
905  each one of these rules is <strong>linearized</strong> regarding the PP
906  operators:</p>
907  <ul>
908    <li><code>Hoa\Compiler\Llk\Rule\Choice</code> for a disjunction</li>
909    <li><code>Hoa\Compiler\Llk\Rule\Concatenation</code> for a
910    concatenation,</li>
911    <li><code>Hoa\Compiler\Llk\Rule\Repetition</code> for a repetition,</li>
912    <li><code>Hoa\Compiler\Llk\Rule\Token</code> for a token (like seen
913    above).</li>
914  </ul>
915  <p>All these rules in this format are accessible through the
916  <code>Hoa\Compiler\Llk\Parser::getRules</code> methods, represented by an
917  array. We will print all the <strong>accessible</strong> rules from the root
918  for a better understanding:</p>
919  <pre><code class="language-php">$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
920
921// 1. Get all rules.
922$rules    = $compiler->getRules();
923
924// 2. Start with the root rule.
925$stack    = [$compiler->getRootRule() => true];
926
927while (false !== current($stack)) {
928    $rule = key($stack);
929    next($stack);
930    echo
931        "\n", '"', $rule, '" is a ',
932        strtolower(substr(get_class($rules[$rule]), 22));
933
934    $subrules = $rules[$rule]->getContent();
935
936    // 3a. Token.
937    if (null === $subrules) {
938        continue;
939    }
940
941    echo ' of rules: ';
942
943    // 3b. Other rules.
944    foreach ((array) $rules[$rule]->getContent() as $subrule) {
945        if (!array_key_exists($subrule, $stack)) {
946            // 4. Unshift new rules to print.
947            $stack[$subrule] = true;
948        }
949
950        echo $subrule, ' ';
951    }
952}
953
954/**
955 * Will output:
956 *     "value" is a choice of rules: 1 2 3 string object array number
957 *     "1" is a token
958 *     "2" is a token
959 *     "3" is a token
960 *     "string" is a concatenation of rules: 5 6 7
961 *     "object" is a concatenation of rules: 10 pair 13 14
962 *     "array" is a concatenation of rules: 18 value 21 22
963 *     "number" is a token
964 *     "5" is a token
965 *     "6" is a token
966 *     "7" is a token
967 *     "10" is a token
968 *     "pair" is a concatenation of rules: string 16 value
969 *     "13" is a repetition of rules: 12
970 *     "14" is a token
971 *     "18" is a token
972 *     "21" is a repetition of rules: 20
973 *     "22" is a token
974 *     "16" is a token
975 *     "12" is a concatenation of rules: 11 pair
976 *     "20" is a concatenation of rules: 19 value
977 *     "11" is a token
978 *     "19" is a token
979 */</code></pre>
980  <p>If we read the <code>object</code> rule, we know that it is a concatenation
981  of the <code>10</code>, <code>pair</code>, <code>13</code> and <code>14</code>
982  rules. <code>10</code> is a lexeme, <code>pair</code> is a concatenation of
983  the <code>string</code>, <code>16</code> and <code>value</code> rules, and so
984  on. The initial grammar is translated in order to be in its
985  <strong>tiniest</strong> form. It allows to <strong>reason</strong> on rules
986  in an <strong>easier</strong> and <strong>faster</strong> way. Indeed,
987  computations on the grammar are not restricted to AST. In the previous step,
988  with the trace, we are already able to apply computations.</p>
989
990  <h3 id="Generation" for="main-toc">Generation</h3>
991
992  <p>A grammar can be used to satisfy two goals: to <strong>validate</strong> a
993  data (if we see it as an automata) or to <strong>generate</strong> a data. So
994  far, we have seen how to validate a data through several steps:
995  <strong>initialization</strong> of our compiler, running of the lexical and
996  syntactic <strong>analyzers</strong>, producing a <strong>trace</strong>,
997  transformation of this trace into an <strong>AST</strong> in order to
998  <strong>visit</strong> that tree. But we can stop at the first step, the
999  initialization of our compiler, to exploit the rules in order to generate a
1000  data which will be valid regarding our grammar.</p>
1001  <p><code>Hoa\Compiler\Llk\Sampler</code> provides three data
1002  <strong>generation</strong> algorithms:</p>
1003  <ul>
1004    <li>random uniform generation,</li>
1005    <li>bounded exhaustive generation,</li>
1006    <li>grammar coverage-based generation.</li>
1007  </ul>
1008  <p>Why providing three algorithms? Because it is illusory to think that one
1009  algorithm can satisfy <strong>numerous</strong> context of usages. Each one
1010  fits different needs, we will explain them later.</p>
1011  <p>Generating a data based on a grammar requires <strong>two
1012  steps</strong>:</p>
1013  <ol>
1014    <li>generating values for all <strong>lexemes</strong> in order to get raw
1015    data,</li>
1016    <li>generating a <strong>path</strong> in the rules of the grammar.</li>
1017  </ol>
1018  <p>A path is equivalent to a derivation, the vocabulary is different according
1019  to our goal: validation or generation.</p>
1020
1021  <h4 id="Isotropic_generation_of_lexemes" for="main-toc">Isotropic generation
1022  of lexemes</h4>
1023
1024  <p>In order to generate values for lexemes, no matter what algorithm is used,
1025  it has to be <strong>fast</strong>. We are going to use a process called
1026  <strong>isotropic</strong>. We start with a rule and we go forward only by
1027  using <strong>random</strong> and <strong>locally uniform</strong> (only for
1028  this choice) choices. For instance, if we have a disjunction between three
1029  sub-rules, we will randomly and uniformly choose between 1 and 3. If we have a
1030  concatenation, we will just concatenate each parts. And finally, a repetition
1031  is nothing more than a disjunction of concatenation: indeed,
1032  <code><em>e</em>{1,3}</code> is strictly equivalent to <code><em>e</em> |
1033  <em>ee</em> | <em>eee</em></code>. Let's see:</p>
1034  <pre><code>([ae]+|[x-z]!){1,3}              <em>repeat <em>[ae]+|[x-z]!</em> 2 times</em>
1035→ ([ae]+|[x-z]!)([ae]+|[x-z]!)  <em>choose between <em>[ae]+</em> and <em>[x-z]!</em></em>
1036→ ([ae]+)([ae]+|[x-z]!)         <em>repeat <code>ae</code> 2 times</em>
1037→ [ae][ae]([ae]+|[x-z]!)        <em>choose between <em>a</em> and <em>e</em></em>
1038→ e[ae]([ae]+|[x-z]!)           <em>again</em>
1039→ ea([ae]+|[x-z]!)              <em>choose between <em>[ae]+</em> and <em>[x-z]!</em></em>
1040→ ea([x-z]!)                    <em>choose between <em>x</em>, <em>y</em> and <em>z</em></em>
1041→ eay!</code></pre>
1042  <p>This generation is naive but this is not important. What is important is
1043  the path generation in rules, or put another way, the generation of
1044  <strong>sequences of lexemes</strong>.</p>
1045
1046  <h4 id="Uniform_random_generation" for="main-toc">Random uniform
1047  generation</h4>
1048
1049  <p>The first algorithm is the <strong>random</strong> and
1050  <strong>uniform</strong> generation. This algorithm is useful if we would like
1051  to generate sequences of lexemes of a <strong>given size <em>n</em></strong>
1052  with a <strong>uniformity</strong>, not local (like in the isotropic
1053  generation), but on the <strong>set</strong> of all the possible sequences.
1054  Succinctly, the algorithm works in two steps: <strong>pre-computation</strong>
1055  (once per size) followed by a <strong>generation</strong>. The pre-computation
1056  is an <strong>automatic</strong> step and computes the <strong>number</strong>
1057  of all the possible sequences and sub-sequences of size <em>n</em>. This step
1058  helps to compute <strong>cumulative distribution functions</strong> in order
1059  to <strong>guide</strong> the final sequences of lexemes.</p>
1060  <p>We will generate 10 random data of size 7, it means composed of 7 lexemes.
1061  That's why we will use the <code>Hoa\Compiler\Llk\Sampler\Uniform</code> class
1062  that takes our grammar in first argument, the lexeme value generator in second
1063  argument (here <code>Hoa\Regex\Visitor\Isotropic</code>) and finally a
1064  size:</p>
1065  <pre><code class="language-php">$sampler = new Hoa\Compiler\Llk\Sampler\Uniform(
1066    // Grammar.
1067    Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')),
1068    // Token sampler.
1069    new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()),
1070    // Length.
1071    7
1072);
1073
1074for ($i = 0; $i &amp;lt; 10; ++$i) {
1075    echo $i, ' => ', $sampler->uniform(), "\n";
1076}
1077
1078/**
1079 * Will output:
1080 *     0 => [ false , null , null ]
1081 *     1 => [ " l " , null ]
1082 *     2 => [ [ true ] , true ]
1083 *     3 => [ [ [ 4435 ] ] ]
1084 *     4 => [ [ [ 9366 ] ] ]
1085 *     5 => [ true , false , null ]
1086 *     6 => { " |h&amp;lt;# " : false }
1087 *     7 => [ [ [ false ] ] ]
1088 *     8 => [ false , true , 7 ]
1089 *     9 => [ false , 5 , 79 ]
1090 */</code></pre>
1091  <p>We can redefine the size with the
1092  <code>Hoa\Compiler\Llk\Sampler\Uniform::setLength</code> method. We have
1093  noticed that some repetition operators have no upper bound, like
1094  <code>+</code> or <code>*</code>; in this case, we set it to <em>n</em>.</p>
1095
1096  <h4 id="Bounded_exhaustive_generation" for="main-toc">Bounded exhaustive
1097  generation</h4>
1098
1099  <p>The second algorithm is the <strong>bounded exhaustive</strong> generation.
1100  This algorithm generates <strong>all</strong> sequences of lexemes (for the
1101  exhaustive side) from size 1 <strong>to</strong> <em>n</em> (for the
1102  bounded side). An interesting aspect of this algorithm is that the
1103  exhaustivity is kind of a <strong>proof</strong>! The algorithm behaves like
1104  an iterator and is very simple to use:</p>
1105  <pre><code class="language-php">$sampler = new Hoa\Compiler\Llk\Sampler\BoundedExhaustive(
1106    // Grammar.
1107    Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')),
1108    // Token sampler.
1109    new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()),
1110    // Length.
1111    7
1112);
1113
1114foreach ($sampler as $i => $data) {
1115    echo $i, ' => ', $data, "\n";
1116}
1117
1118/**
1119 * Will output:
1120 *     0 => true
1121 *     1 => false
1122 *     2 => null
1123 *     3 => " 8u2 "
1124 *     4 => { " ,M@aj{ " : true }
1125 *     5 => { " x`|V " : false }
1126 *     6 => { " NttB " : null }
1127 *     7 => { " eJWwA " : 0 }
1128 *     8 => [ true ]
1129 *     9 => [ true , true ]
1130 *     10 => [ true , true , true ]
1131 *     11 => [ true , true , false ]
1132 *     12 => [ true , true , null ]
1133 *     13 => [ true , true , 32 ]
1134 *     14 => [ true , false ]
1135 *     15 => [ true , false , true ]
1136 *     16 => [ true , false , false ]
1137 *     17 => [ true , false , null ]
1138 *     18 => [ true , false , 729 ]
1139 *     19 => [ true , null ]
1140 *     20 => [ true , null , true ]
1141 *     21 => [ true , null , false ]
1142 *     …
1143 *     157 => [ 780 , 01559 , 32 ]
1144 *     158 => 344
1145 */</code></pre>
1146  <p>Like the previous algorithm, we can redefine the upper bound with the
1147  <code>Hoa\Compiler\Llk\Sampler\BoundedExhaustive::setLength</code> method. And
1148  the repetition operators with no upper bounds are bounded to <em>n</em>.</p>
1149
1150  <h4 id="Grammar_coverage-based_generation" for="main-toc">Grammar
1151  coverage-based generation</h4>
1152
1153  <p>The last algorithm is the <strong>grammar coverage-based</strong>
1154  generation. This algorithm reduces the <strong>combinatory explosion</strong>
1155  encountered with the previous algorithm but the goal is different: it will
1156  generate sequences of lexemes that will “<strong>activate</strong>” all the
1157  rules <strong>branches</strong> of the grammar. A rule is said covered if and
1158  only if all its sub-rules are covered, and a lexeme is said covered if it has
1159  been used in a sequence. To ensure a <strong>diversity</strong> in the
1160  produced sequences, choice points between uncovered sub-rules are resolved
1161  <strong>randomly</strong>. The algorithm also behaves like an iterator:</p>
1162  <pre><code class="language-php">$sampler = new Hoa\Compiler\Llk\Sampler\Coverage(
1163    // Grammar.
1164    Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')),
1165    // Token sampler.
1166    new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random())
1167);
1168
1169foreach ($sampler as $i => $data) {
1170    echo $i, ' => ', $data, "\n";
1171}
1172
1173/**
1174 * Will output:
1175 *     0 => true
1176 *     1 => { " )o?bz " : null , " %3W) " : [ false , 130 , " 6 " ] }
1177 *     2 => [ { " ny  " : true } ]
1178 *     3 => { " Ne;[3 " : [ true , true ] , " th: " : true , " C[8} " : true }
1179 */</code></pre>
1180  <p>To avoid the combinatory explosion and to ensure the
1181  <strong>termination</strong> of the algorithm, we will use the following
1182  <strong>heuristic</strong> on the <strong>repetition</strong> operators:
1183  <code>*</code> will repeat <code>0</code>, <code>1</code> and
1184  <code>2</code> times, <code>+</code> will repeat <code>1</code> and
1185  <code>2</code> times, and finally <code>{<em>x</em>,<em>y</em>}</code> will
1186  repeat <code><em>x</em></code>, <code><em>x</em> + 1</code>,
1187  <code><em>y</em> - 1</code> and <code><em>y</em></code> times. This heuristic
1188  finds its origin in the <strong>limit</strong> testing.</p>
1189  <p>We notice in our example that 4 data are generated and are enough to
1190  <strong>cover</strong> entirely our grammar!</p>
1191
1192  <h4 id="Comparison_between_algorithms" for="main-toc">Comparison between
1193  algorithms</h4>
1194
1195  <p>Here are some <strong>clues</strong> to know when to use one algorithm
1196  instead of another.</p>
1197  <dl>
1198    <dt>Random uniform generation:</dt>
1199    <dd><ul>
1200      <li data-item="+">fast for small data, high data diversity and fixed
1201      size,</li>
1202      <li data-item="-">the pre-computation step is exponential and therefore
1203      long while, next, the generation is very fast.</li>
1204    </ul></dd>
1205    <dt>Bounded exhaustive generation:</dt>
1206    <dd><ul>
1207      <li data-item="+">fast for small data and exhaustivity is efficient,</li>
1208      <li data-item="-">exponential number of data.</li>
1209    </ul></dd>
1210    <dt>Grammar coverage-based generation:</dt>
1211    <dd><ul>
1212      <li data-item="+">fast for average or big data, and diversity of
1213      data,</li>
1214      <li data-item="-">do not consider size.</li>
1215    </ul></dd>
1216  </dl>
1217
1218  <h2 id="LL1_compiler-compiler" for="main-toc">LL(1) compiler-compiler</h2>
1219
1220  <p>The documentation of the LL(1) compiler, through the
1221  <code>Hoa\Compiler\Ll1</code> class, is not written yet. The goal is different
1222  regarding <code>Hoa\Compiler\Llk</code>: there is no grammar description
1223  language, automata are hard-coded with arrays and this type of compiler is
1224  high performance oriented. That's why its behavior is
1225  <strong>monolitic</strong>.</p>
1226
1227  <h2 id="Conclusion" for="main-toc">Conclusion</h2>
1228
1229  <p><code>Hoa\Compiler</code> provides two <strong>compiler-compilers</strong>:
1230  <code>Hoa\Compiler\Llk</code> and <code>Hoa\Compiler\Ll1</code>, in order to
1231  <strong>validate</strong> data. The <strong>PP language</strong> allows to
1232  write <strong>algebraic grammars</strong> in a <strong>simple</strong> and
1233  <strong>natural</strong> way. The LL(<em>k</em>) compiler-compiler is split
1234  in <strong>distinct processus</strong>, which encourages hackability. Finally,
1235  several algorithms allows to <strong>generate</strong> data based on a grammar
1236  according to specific usages.</p>
1237
1238</yield>
1239</overlay>
1240