Compilers allow to analyze and manipulate textual data. There is numerous usages. Hoa\Compiler offers to manipulate several compilers based on needs.

Table of contents

Introduction

What is conceived well is expressed clearly, and the right words come easily.

A language is a way to express or to formulate a solution to a problem. And there exists a lot of problems. We read and write in several languages daily, and some of these languages are understood by machines. This operation is possible thanks to compilers.

Formal languages is a theory that includes the automatic analysis of those languages through tools like automata or grammars. It is necessary to have a detailed explanation to understand well all these concepts. However, we are going to popularise as much as needed to allow a correct usage of Hoa\Compiler.

Language and grammar

A language is a set of words. Each word is a sequence of symbols belonging to an alphabet. A symbol represents the smallest lexical unit of a language, it is atomic and we will call it a lexeme (also often mentionned as a token). These sequences of lexemes representing words are built with rules. From a word and a root rule, we will try to derive its sub-rules. If a derivation exists, then the word is considered as valid, else it is considered as invalid. We also speak about words matching. For example, if we consider the following rules:

    exp ::= exp + exp
          | number
 number ::= digit number
          | digit
digit   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The word that we are going to match is 7 + 35. The root rule is exp. If we derive (from the left to the right and from the top to the bottom), we can have exp + exp or number (the disjunction, i.e. the “or”, is represented by the “|” symbol):

exp + exp | number
→ exp + exp
→ ( exp + exp | number ) + exp
→ number + exp
→ ( digit number | digit ) + exp

We continue to derive until we eliminate every rules in order to have only lexemes:

…
→ ( digit number | digit ) + exp
→ digit + exp
→ ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) + exp
→ 7 + exp
→ 7 + ( exp + exp | number )
→ 7 + number
→ 7 + ( digit number | digit )
→ 7 + digit number
→ 7 + ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) number
→ 7 + 3 number
→ 7 + 3 ( digit number | digit )
→ 7 + 3 digit
→ 7 + 3 ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )
→ 7 + 3 5

A derivation exists to match the word 7 + 35, thus it is a valid word for those rules.

A set of rules is called a grammar. And so, a grammar represents a language!

However, these are different kinds of grammars. In 1956, the Chomsky hierarchy has been formulated, classifying grammars in four levels:

  1. unrestricted grammars, matching langages known as Turing languages, rules have no restriction,
  2. context-sensitive grammars, matching contextual languages,
  3. context-free grammars, matching algebraic languages, based on stacked automata,
  4. regular grammars, matching regular languages.

Each level includes the next level. Hoa\Compiler only treats languages defined by grammars of level 3 and 4. To give a quick idea, regular grammars can be connected to regular expressions (like PCRE), well-known by developers. But regular grammars are not able to match a pair of symbols (like parentheses, braces or quotes), while algebraic languages allow that (thanks to the concept of stack of lexemes).

Matching words

In general, the compilation process begins with two analyzes: lexical and syntactic. A lexical analysis splits a word in a sequence of lexemes. This sequence will thereafter be used by the syntactic analyzer in order to verify that the word belongs to the language.

According to the grammar, the matching will not be done in the same way, but the principle stays the same: using lexemes one after the other in the sequence and verify that they allow going forward in the derivation of grammar's rules.

Syntactic analyzes are also classified into categories: LL, LR, LALR etc. Hoa\Compiler only offers LL syntactic analyzers, stands for Left-to-right Leftmost derivation, i.e. from the topest rule to the deepest, and rules are derived from the left to the right. Here, again, there is sub-categories, whose two supported by Hoa\Compiler: LL(1) and LL(*). Globally, we speak about LL(k) syntactic analyzer: if a lexeme does not allow to derive a rule as expected, then the analyzer is allowed to go back to k step backward; we also speak about backtrack. Put another way, rules can be ambiguous: each time we derive a rule of the grammar, we have several possible choices and the analyzer can be wrong, this is why it needs to backtrack sometimes. The k variable defines the ambiguity level. If a grammar can be analyzed by a LL(1) syntactic analyzer, it is known as unambiguous: each time a lexeme is used to derive our rules, there is only one choice. And if we have a LL(*) syntactic analyzer, it means that the k variable is undefined. The following example illustrates an unambiguous grammar:

rule ::= a b c | d e f

And this example illustrates an ambiguous grammar:

rule1 ::= a rule2
rule2 ::= b rule3 | b rule4
rule3 ::= c d
rule4 ::= e f

Let's see what happens when trying to find a derivation for the word abef from the root rule rule1:

rule1
→ a rule2                      a  bef ✔
  → a (b rule3 | b rule4)      a  bef
    → a b rule3               ab  ef  ✔
      → a b c d              abe  f   ✖
    ← a b rule3               ab  ef  ✖
  ← a (b rule3 | b rule4)      a  bef
    → a b rule4               ab  ef  ✔
      → a b e f             abef      ✔

The rule2 rule is ambiguous, which can lead to a bad derivation and consequently a backtrack.

LL(k) compiler-compiler

Writing compilers is a laborious task. It is not always difficult but most of the time, it is repetitive and it takes time. This is why there exists compilers of compilers, or in other words, compilers generators. Most of the time, these compiler-compilers use an intermediary language to write a grammar. The Hoa\Compiler library provides the Hoa\Compiler\Llk\Llk class that allows the writing of compiler-compilers through a dedicated language.

PP language

The PP language, standing for PHP Parser, allows to express algebraic grammars. It is written in files having the .pp extension (please, see the hoa://Library/Compiler/.Mime file).

A grammar is composed of lexemes and rules. The declaration of a lexeme has the following syntax: %token namespace_in:name value -> namespace_out, where name represents the name of the lexeme, value represents its value, expressed with the PCRE format (take care to not match an empty value, in which case an exception will be thrown), and namespace_in and namespace_out represent the names of namespaces and are optional (the default name is default). For example number describing a number composed of digits from 0 to 9:

%token number \d+

Namespaces represent disjointed subsets of lexemes, used to ease analyzes. A %skip declaration is similar to %token excepts that it represents a lexeme to skip, it means to not consider. A common example of %skip lexemes are spaces:

%skip space \s

To explain rules, we will use the Json.pp grammar as an example, which is a softly simplified version of the JSON language (please, see the RFC4627). The complete grammar is located in the hoa://Library/Json/Grammar.pp file. Thus:

%skip   space          \s
// Scalars.
%token  true           true
%token  false          false
%token  null           null
// Strings.
%token  quote_         "        -> string
%token  string:string  [^"]+
%token  string:_quote  "        -> default
// Objects.
%token  brace_         {
%token _brace          }
// Arrays.
%token  bracket_       \[
%token _bracket        \]
// Rest.
%token  colon          :
%token  comma          ,
%token  number         \d+

value:
    <true> | <false> | <null> | string() | object() | array() | number()

string:
    ::quote_:: <string> ::_quote::

number:
    <number>

#object:
    ::brace_:: pair() ( ::comma:: pair() )* ::_brace::

#pair:
    string() ::colon:: value()

#array:
    ::bracket_:: value() ( ::comma:: value() )* ::_bracket::

Notice that we have two namespaces of lexemes: default and string (it allows to avoid confusion with the quote_ and string::_quote lexemes that have the same representation). We also notice the value rule, which is a disjunction of several lexemes and rules. The constructions of the PP language are the following:

  • rule() to call a rule,
  • <token> and ::token:: to declare a lexeme (namespaces do not appear here),
  • | for a disjunction (a choice),
  • () for a group,
  • e? to say that e is optional,
  • e+ to say that e can be present 1 or more times,
  • e* to say that e can be present 0 or more times,
  • e{x,y} to say that e can be present between x and y times,
  • #node to create a node node in the resulting tree,
  • token[i] to unify lexemes between each others.

Few constructions but largely enough.

Finally, the grammar of the PP language is written… with the PP language! You can find it in the hoa://Library/Compiler/Llk/Llk.pp file.

Compilation process

The compilation process used by Hoa\Compiler\Llk\Llk is classical: it starts by lexically analyzing the textual data, the word, i.e. to transform our data in a sequence of lexemes. The declaration order of the lexemes is primordial because the lexical analyzer will use them one after the other. Next, the syntactic analyzer comes into play in order to match our data.

If the syntactic analysis is successful, we obtain a trace. This trace can be transformed into an AST, stands for Abstract Syntax Tree. This tree represents our textual data after the analysis. One advantage is that it can visited (we will detail this part later), which allows us to add new constraints which can be not be expressed in the grammar, such as type verification.

Let's manipulate Hoa\Compiler\Llk\Llk a little bit. This class is a helper to easily read a grammar expressed with the PP format. This class takes only one argument, which is an input stream that points to the grammar and returns a Hoa\Compiler\Llk\Parser compiler ready to use. On this compiler, we will call the Hoa\Compiler\Llk\Parser::parse to analyze a JSON data. If the data is correct, we will get an AST, otherwise an exception will be thrown. Finally, we will use the Hoa\Compiler\Visitor\Dump visitor to print our AST:

// 1. Load grammar.
$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));

// 2. Parse a data.
$ast      = $compiler->parse('{"foo": true, "bar": [null, 42]}');

// 3. Dump the AST.
$dump     = new Hoa\Compiler\Visitor\Dump();
echo $dump->visit($ast);

/**
 * Will output:
 *     >  #object
 *     >  >  #pair
 *     >  >  >  token(string:string, foo)
 *     >  >  >  token(true, true)
 *     >  >  #pair
 *     >  >  >  token(string:string, bar)
 *     >  >  >  #array
 *     >  >  >  >  token(null, null)
 *     >  >  >  >  token(number, 42)
 */

When we write and test a grammar, we will repeat these three tasks very often. This is why the hoa script provides the compiler:pp command. This command proposes to analyze a data based on a grammar and to apply a visitor if needed on the resulting AST. Notice that the grammar can be read through a pipe:

$ echo '[1, [1, [2, 3], 5], 8]' | hoa compiler:pp Json.pp 0 --visitor dump
>  #array
>  >  token(number, 1)
>  >  #array
>  >  >  token(number, 1)
>  >  >  #array
>  >  >  >  token(number, 2)
>  >  >  >  token(number, 3)
>  >  >  token(number, 5)
>  >  token(number, 8)

This is a useful way to quickly test a data against a grammar. Do not hesitate to read the help of the compiler:pp command to get more details!

The analyzes start with the root rule but we can use another rule thanks to the second argument of the Hoa\Compiler\Llk\Parser::parse method:

$compiler->parse('{"foo": true, "bar": [null, 42]}', 'object');

To use the root rule, we have to use null.

And finally, to not generate the AST but only know if a data is valid or not, we can use the last argument of this method by using false:

$valid = $compiler->parse('{"foo": true, "bar": [null, 42]}', null, false);
var_dump($valid);

/**
 * Will output:
 *     bool(true)
 */

Errors

The compilation errors are represented by exceptions, thus:

$ echo '{"foo" true}' | hoa compiler:pp Json.pp 0 --visitor dump
Uncaught exception (Hoa\Compiler\Exception\UnexpectedToken):
Hoa\Compiler\Llk\Parser::parse(): (0) Unexpected token "true" (true) at line 1
and column 8:
{"foo" true}
       ↑
in hoa://Library/Compiler/Llk/Parser.php at line 1

Several exceptions can be thrown according to the context:

  • during the lexical analysis, Hoa\Compiler\Exception\UnrecognizedToken when a lexeme is not recognized, i.e. when the textual data can not be split into a sequence of lexemes, and Hoa\Compiler\Exception\Lexer when other more general errors happen, for example if a lexeme matches an empty value,
  • during the syntactic analysis, Hoa\Compiler\Exception\UnexpectedToken when a lexeme is not expected, i.e. it is not able to derive the rules of the grammar.

The parent exception is Hoa\Compiler\Exception\Exception.

Nodes

The compilation process very often succeeds by the production of an AST. It is important to control its form, its size, the data it holds etc. This is why it is necessary to understand the #node notation because it allows to create nodes in the AST. First, lexemes declared with the <token> syntax will appear in the tree, while the other lexemes, declared with the ::token:: syntax, will not appear. Indeed, in our last example, the quote_, brace_, colon, comma etc. lexemes do not appear. Next, it is important to notice that the declaration of nodes can be overwritten inside a same rule. Finally, a rule name can be preceeded by #, just like the #array rule, that allows to define a default node, but it can be overwritten. For example, if we replace the array rule by:

#array:
    ::bracket_:: value() ( ::comma:: value() #bigarray )* ::_bracket::

If the array contains only one value, the node will be named #array, else it will be named #bigarray. Let's see:

$ echo '[42]' | hoa compiler:pp Json.pp 0 --visitor dump
>  #array
>  >  token(number, 42)
$ echo '[4, 2]' | hoa compiler:pp Json.pp 0 --visitor dump
>  #bigarray
>  >  token(number, 4)
>  >  token(number, 2)

Of course, a node can be created or not based on the derivation of a rule. The mecanism is normally pretty much intuitive.

Namespaces

Let's detail a little bit the behavior of the lexical analyzer regarding the namespaces.

Lexemes are put in namespaces, it means that only the lexemes of the current namespace are used by the lexical analyzer. The default namespace is default. To declare the namespace of a lexeme, we have to use the : operator. When a lexeme is consumed, it is able to change the current namespace for the rest of the lexical analysis, thanks to the -> operator. Thus, we will declare five lexemes: foo and bar in the default namespace, baz in ns1 and qux and foo in ns2. The fact that foo is declared two times is not embarrassing because it is put in different namespaces:

%token      foo   fo+     -> ns1
%token      bar   ba?r+   -> ns2
%token  ns1:baz   ba?z+   -> default
%token  ns2:qux   qux+
%token  ns2:foo   FOO     -> ns1

Writing default:bar is strictly equivalent to bar. The foo lexeme leads into ns1, bar into ns2, ns1:baz into default, ns2:qux stays in ns2 and ns2:foo leads into ns1. Let's observe the sequence of lexemes produced by the lexical analyzer with the following data fooooobzzbarrrquxFOObaz:

$pp = '%token      foo   fo+     -> ns1'     . "\n" .
      '%token      bar   ba?r+   -> ns2'     . "\n" .
      '%token  ns1:baz   ba?z+   -> default' . "\n" .
      '%token  ns2:qux   qux+'               . "\n" .
      '%token  ns2:foo   FOO     -> ns1';

// 1. Parse PP.
Hoa\Compiler\Llk\Llk::parsePP($pp, $tokens, $rawRules);

// 2. Run the lexical analyzer.
$lexer    = new Hoa\Compiler\Llk\Lexer();
$sequence = $lexer->lexMe('fooooobzzbarrrquxFOObaz', $tokens);

// 3. Pretty-print the result.
$format   = '%' . (strlen((string) count($sequence)) + 1) . 's  ' .
            '%-13s %-20s  %-20s  %6s' . "\n";
$header   = sprintf($format, '#', 'namespace', 'token name', 'token value', 'offset');

echo $header, str_repeat('-', strlen($header)), "\n";

foreach ($sequence as $i => $token) {
    printf(
        $format,
        $i,
        $token['namespace'],
        $token['token'],
        $token['value'],
        $token['offset']
    );
}

/**
 * Will output:
 *      #  namespace     token name            token value           offset
 *     ---------------------------------------------------------------------
 *      0  default       foo                   fooooo                     0
 *      1  ns1           baz                   bzz                        6
 *      2  default       bar                   barrr                      9
 *      3  ns2           qux                   qux                       14
 *      4  ns2           foo                   FOO                       17
 *      5  ns1           baz                   baz                       20
 *      6  default       EOF                   EOF                       23
 */

We read the lexemes, their namespace and their default value in the table. The data has been successfully splitted, and we have jumped from one namespace to the other. If this time we try with the foqux data, we will have an error: fo is matched by the foo lexeme in the default namespace, we then jump to the ns1 namespace, and here, no lexeme in this namespace is able to recognize at least q. An error is thrown.

So far, we have seen how to jump from one namespace to another with the -> operator. No history about the used namespaces is stored. However, in some rare cases, it is able for a lexeme to be accessible through several namespaces and we would like that a lexeme jumps back to the previous namespace. Put another way, we would like a history of the used namespaces and being able to navigate (backward only) in it. This is the role of the __shift__ * n keyword, in conjunction with the -> operator as usual. __shift__ is equivalent to say: jump back to the previous namespace. __shift__ is equivalent to __shift__ * 1, and __shift__ * n to say: jump back n times to the previous namespace.

When the __shift__ keyword appears in the grammar, namespaces are managed like a stack, hence the shift vocabulary. We have to take care to shift namespaces regularly in order to avoid a huge stack.

Let's take the example of the following grammar:

%token       foo1  a    -> ns1
%token       foo2  x    -> ns2
%token       end   e

%token   ns1:bar   b
%token   ns1:baz   c    -> ns3
%token   ns1:tada  t    -> __shift__

%token   ns2:bar   y
%token   ns2:baz   z    -> ns3
%token   ns2:tada  T    -> __shift__

%token   ns3:qux   =   -> __shift__

#test:
    ( <foo1> | <foo2> ) <bar> <baz> <qux> <tada> <end>

This grammar recognizes two data: abc=te and xyz=Te. If the first lexeme foo1 is matched, the syntactic analyzer will jump to the ns1 namespace. When it will match the ns1:baz namespace, it will jump into ns3. Then, when it will match ns3:qux, it will jump back to the previous namespace, being ns1. And so on. Now, if it does not match foo1 but foo2, it will jump to the ns2 namespace, and then in ns3 and then in ns2 again. The __shift__ keywords in ns1:tada and ns2:tada are only used to shift namespaces but no ambiguity exists: these namespaces are accessible by only one namespace, namely default.

Now, let's save this grammar in the NamespaceStack.pp file and use the -s/--token-sequence option of the hoa compiler:pp command:

$ echo -n 'abc=te' | hoa compiler:pp NamespaceStack.pp 0 --token-sequence
 #  namespace     token name            token value                     offset
-------------------------------------------------------------------------------
 0  default       foo1                  a                                    0
 1  ns1           bar                   b                                    1
 2  ns1           baz                   c                                    2
 3  ns3           qux                   =                                    3
 4  ns1           tada                  t                                    4
 5  default       end                   e                                    5
 6  default       EOF                   EOF                                  6

$ echo -n 'xyz=Te' | hoa compiler:pp NamespaceStack.pp 0 --token-sequence
 #  namespace     token name            token value                     offset
-------------------------------------------------------------------------------
 0  default       foo2                  x                                    0
 1  ns2           bar                   y                                    1
 2  ns2           baz                   z                                    2
 3  ns3           qux                   =                                    3
 4  ns2           tada                  T                                    4
 5  default       end                   e                                    5
 6  default       EOF                   EOF                                  6

We see that the lexical analyzer has successfully jumped between all namespaces, as expected. We had two ways to access to the ns3 namespace: either from ns1 or from ns2. The analyzer has succeeded to create a history of namespaces and navigate in it.

Unification

A new feature brought by the PP language regarding other grammar languages is the capability to express a unification of lexemes. Let's imagine the following grammar:

%token  quote   '|"
%token  string  \w+

rule:
    ::quote:: <string> ::quote::

The quotes surrounding the string can be of two kinds: simple, with the “'” symbol, or double, with the “"” symbol. Thus, the "foo" and 'foo' data are valid, but also "foo' and 'foo"!

The unification of lexemes allows to add an additionnal constraint on the value of lexemes at runtime. The syntax is the following: token[i]. The value of i indicates what lexemes will have the same value. And finally, the unification is locale to an instance of a rule, it means that there is no unification between lexemes of different rules and the unification is only applied on the current rule. Thus, the example becomes:

rule:
    ::quote[0]:: <string> ::quote[0]::

Which invalidates the "foo' and 'foo" data.

Unification finds numerous applications, such as the name of opening and closing tags in the XML language.

Abstract Syntax Tree

An abstract syntax tree represents a textual data in a structural form. Each node of this tree is represented by the Hoa\Compiler\Llk\TreeNode class. Among the most useful methods, we find:

  • getId to get the identifier of the node,
  • getValueToken to get the name of the lexeme,
  • getValueValue to get the value of the lexeme,
  • isToken if the token represents a lexeme,
  • getChild($i) to get the $i-th child of the node,
  • getChildren to get all the children,
  • getChildrenNumber to count the number of children,
  • getParent to get the parent node,
  • getData to get a reference to a data array,
  • accept to apply the visitor.

Visitors are the most simple way to cross an AST. As an example, let's write the PrettyPrinter visitor which will rewrite a JSON data with our own conventions (spacing, line breaking etc.). A visitor must implement the Hoa\Visitor\Visit interface and will have only one method to implement: visit with three arguments: the element and two optional arguments (by reference and by copy). Let's see:

class PrettyPrinter implements Hoa\Visitor\Visit
{
    public function visit (
        Hoa\Visitor\Element $element,
        &$handle = null,
        $eldnah = null
    ) {
        static $i      = 0;
        static $indent = '    ';

        // One behaviour per node in the AST.
        switch ($element->getId()) {
            // Object: { … }.
            case '#object':
                echo '{', "\n";
                ++$i;

                foreach ($element->getChildren() as $e => $child) {
                    if (0 < $e) {
                        echo ',', "\n";
                    }

                    echo str_repeat($indent, $i);
                    $child->accept($this, $handle, $eldnah);
                }

                echo "\n", str_repeat($indent, --$i), '}';

                break;

            // Array: [ … ].
            case '#array':
                echo '[', "\n";
                ++$i;

                foreach ($element->getChildren() as $e => $child) {
                    if (0 < $e) {
                        echo ',', "\n";
                    }

                    echo str_repeat($indent, $i);
                    $child->accept($this, $handle, $eldnah);
                }

                echo "\n", str_repeat($indent, --$i), ']';

                break;

            // Pair: "…": ….
            case '#pair':
                echo
                    $element->getChild(0)->accept($this, $handle, $eldnah),
                    ': ',
                    $element->getChild(1)->accept($this, $handle, $eldnah);

                break;

            // Many tokens.
            case 'token':
                switch ($element->getValueToken()) {
                    // String: "…".
                    case 'string':
                        echo '"', $element->getValueValue(), '"';

                        break;

                    // Booleans.
                    case 'true':
                    case 'false':

                    // Null.
                    case 'null':

                    // Number.
                    case 'number':
                        echo $element->getValueValue();

                        break;
                }

                break;
        }
    }
}

We are going to see an example:

$compiler    = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
$ast         = $compiler->parse('{"foo": true, "bar": [null, 42]}');
$prettyPrint = new PrettyPrinter();
echo $prettyPrint->visit($ast);

/**
 * Will output:
 *     {
 *         "foo": true,
 *         "bar": [
 *             null,
 *             42
 *         ]
 *     }
 */

The getData method is very useful to store data that are likely to be reused, for example from one visitor to another. This method returns a reference to an array; thus:

$data = $element->getData();

if (!isset($data['previousComputing'])) {
    throw new Exception('Need a previous computing.', 0);
}

$previous = $data['previousComputing'];

It is common to use one visitor per constraint: verification of symbols, verification of types etc. Some of them can set data for the next ones. The getData method does not require any data structure, it only provides an access to an array. Feel free to do what you want with it.

Using the Hoa\Compiler\Llk\TreeNode is very trivial and we will use it most of the time with a visitor.

Traces

The LL(k) compiler provided by Hoa is clearly composed of layers. Each layer can be used with a maximal modularity and extensibility. When the grammar is translated into “machine rules” and that the lexical and syntactic analyzers have validated a data, a trace is computed. This trace is a flat array containing only three kind of classes:

  • Hoa\Compiler\Llk\Rule\Entry when the syntactic analyzer has opened a rule,
  • Hoa\Compiler\Llk\Rule\Ekzit when the syntactic analyzer has closed a rule,
  • Hoa\Compiler\Llk\Rule\Token when the syntactic analyzer has encountered a lexeme.

We can get the trace with the Hoa\Compiler\Llk\Parser::getTrace method. To understand this trace, we will start with an example:

$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
$ast      = $compiler->parse('{"foo": true, "bar": [null, 42]}');
$i        = 0;

foreach ($compiler->getTrace() as $element) {
    if ($element instanceof Hoa\Compiler\Llk\Rule\Entry) {
        echo str_repeat('>   ', ++$i), 'enter ', $element->getRule(), "\n";
    } elseif ($element instanceof Hoa\Compiler\Llk\Rule\Token) {
        echo
            str_repeat('    ', $i + 1), 'token ', $element->getTokenName(),
            ', consumed ', $element->getValue(), "\n";
    } else {
        echo str_repeat('<   ', $i--), 'ekzit ', $element->getRule(), "\n";
    }
}

/**
 * Will output:
 *     >   enter value
 *     >   >   enter object
 *                 token brace_, consumed {
 *     >   >   >   enter pair
 *     >   >   >   >   enter string
 *                         token quote_, consumed "
 *                         token string, consumed foo
 *                         token _quote, consumed "
 *     <   <   <   <   ekzit string
 *                     token colon, consumed :
 *     >   >   >   >   enter value
 *                         token true, consumed true
 *     <   <   <   <   ekzit value
 *     <   <   <   ekzit pair
 *     >   >   >   enter 13
 *     >   >   >   >   enter 12
 *                         token comma, consumed ,
 *     >   >   >   >   >   enter pair
 *     >   >   >   >   >   >   enter string
 *                                 token quote_, consumed "
 *                                 token string, consumed bar
 *                                 token _quote, consumed "
 *     <   <   <   <   <   <   ekzit string
 *                             token colon, consumed :
 *     >   >   >   >   >   >   enter value
 *     >   >   >   >   >   >   >   enter array
 *                                     token bracket_, consumed [
 *     >   >   >   >   >   >   >   >   enter value
 *                                         token null, consumed null
 *     <   <   <   <   <   <   <   <   ekzit value
 *     >   >   >   >   >   >   >   >   enter 21
 *     >   >   >   >   >   >   >   >   >   enter 20
 *                                             token comma, consumed ,
 *     >   >   >   >   >   >   >   >   >   >   enter value
 *                                                 token number, consumed 42
 *     <   <   <   <   <   <   <   <   <   <   ekzit value
 *     <   <   <   <   <   <   <   <   <   ekzit 20
 *     <   <   <   <   <   <   <   <   ekzit 21
 *                                     token _bracket, consumed ]
 *     <   <   <   <   <   <   <   ekzit array
 *     <   <   <   <   <   <   ekzit value
 *     <   <   <   <   <   ekzit pair
 *     <   <   <   <   ekzit 12
 *     <   <   <   ekzit 13
 *                 token _brace, consumed }
 *     <   <   ekzit object
 *     <   ekzit value
 */

This example reveals several things. First, the informations given by the trace can be useful: if we are on a rule, we have its name (with the getRule method), and if we are on a lexeme, we have its name (with the getTokenName method), its representation (expressed in the PCRE format, with the getRepresentation method), its value (with the getValue method), if it is a node to keep in the AST (with the isKept method) and its unification index if it exists (with the getUnificationIndex method). Of course, all these informations can be edited, which can influence the construction of the AST which is the (optional) next step.

Next, we notice that sometimes we enter in a rule that exists in the grammar, like object, pair, value etc., and sometimes we enter in a numerical rule, like 13, 12, 21, 20 etc. When the grammar is interpreted in order to be translated into “machine rules”, each one of these rules is linearized regarding the PP operators:

  • Hoa\Compiler\Llk\Rule\Choice for a disjunction
  • Hoa\Compiler\Llk\Rule\Concatenation for a concatenation,
  • Hoa\Compiler\Llk\Rule\Repetition for a repetition,
  • Hoa\Compiler\Llk\Rule\Token for a token (like seen above).

All these rules in this format are accessible through the Hoa\Compiler\Llk\Parser::getRules methods, represented by an array. We will print all the accessible rules from the root for a better understanding:

$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));

// 1. Get all rules.
$rules    = $compiler->getRules();

// 2. Start with the root rule.
$stack    = [$compiler->getRootRule() => true];

while (false !== current($stack)) {
    $rule = key($stack);
    next($stack);
    echo
        "\n", '"', $rule, '" is a ',
        strtolower(substr(get_class($rules[$rule]), 22));

    $subrules = $rules[$rule]->getContent();

    // 3a. Token.
    if (null === $subrules) {
        continue;
    }

    echo ' of rules: ';

    // 3b. Other rules.
    foreach ((array) $rules[$rule]->getContent() as $subrule) {
        if (!array_key_exists($subrule, $stack)) {
            // 4. Unshift new rules to print.
            $stack[$subrule] = true;
        }

        echo $subrule, ' ';
    }
}

/**
 * Will output:
 *     "value" is a choice of rules: 1 2 3 string object array number
 *     "1" is a token
 *     "2" is a token
 *     "3" is a token
 *     "string" is a concatenation of rules: 5 6 7
 *     "object" is a concatenation of rules: 10 pair 13 14
 *     "array" is a concatenation of rules: 18 value 21 22
 *     "number" is a token
 *     "5" is a token
 *     "6" is a token
 *     "7" is a token
 *     "10" is a token
 *     "pair" is a concatenation of rules: string 16 value
 *     "13" is a repetition of rules: 12
 *     "14" is a token
 *     "18" is a token
 *     "21" is a repetition of rules: 20
 *     "22" is a token
 *     "16" is a token
 *     "12" is a concatenation of rules: 11 pair
 *     "20" is a concatenation of rules: 19 value
 *     "11" is a token
 *     "19" is a token
 */

If we read the object rule, we know that it is a concatenation of the 10, pair, 13 and 14 rules. 10 is a lexeme, pair is a concatenation of the string, 16 and value rules, and so on. The initial grammar is translated in order to be in its tiniest form. It allows to reason on rules in an easier and faster way. Indeed, computations on the grammar are not restricted to AST. In the previous step, with the trace, we are already able to apply computations.

Generation

A grammar can be used to satisfy two goals: to validate a data (if we see it as an automata) or to generate a data. So far, we have seen how to validate a data through several steps: initialization of our compiler, running of the lexical and syntactic analyzers, producing a trace, transformation of this trace into an AST in order to visit that tree. But we can stop at the first step, the initialization of our compiler, to exploit the rules in order to generate a data which will be valid regarding our grammar.

Hoa\Compiler\Llk\Sampler provides three data generation algorithms:

  • random uniform generation,
  • bounded exhaustive generation,
  • grammar coverage-based generation.

Why providing three algorithms? Because it is illusory to think that one algorithm can satisfy numerous context of usages. Each one fits different needs, we will explain them later.

Generating a data based on a grammar requires two steps:

  1. generating values for all lexemes in order to get raw data,
  2. generating a path in the rules of the grammar.

A path is equivalent to a derivation, the vocabulary is different according to our goal: validation or generation.

Isotropic generation of lexemes

In order to generate values for lexemes, no matter what algorithm is used, it has to be fast. We are going to use a process called isotropic. We start with a rule and we go forward only by using random and locally uniform (only for this choice) choices. For instance, if we have a disjunction between three sub-rules, we will randomly and uniformly choose between 1 and 3. If we have a concatenation, we will just concatenate each parts. And finally, a repetition is nothing more than a disjunction of concatenation: indeed, e{1,3} is strictly equivalent to e | ee | eee. Let's see:

([ae]+|[x-z]!){1,3}              repeat [ae]+|[x-z]! 2 times
→ ([ae]+|[x-z]!)([ae]+|[x-z]!)  choose between [ae]+ and [x-z]!
→ ([ae]+)([ae]+|[x-z]!)         repeat ae 2 times
→ [ae][ae]([ae]+|[x-z]!)        choose between a and e
→ e[ae]([ae]+|[x-z]!)           again
→ ea([ae]+|[x-z]!)              choose between [ae]+ and [x-z]!
→ ea([x-z]!)                    choose between x, y and z
→ eay!

This generation is naive but this is not important. What is important is the path generation in rules, or put another way, the generation of sequences of lexemes.

Random uniform generation

The first algorithm is the random and uniform generation. This algorithm is useful if we would like to generate sequences of lexemes of a given size n with a uniformity, not local (like in the isotropic generation), but on the set of all the possible sequences. Succinctly, the algorithm works in two steps: pre-computation (once per size) followed by a generation. The pre-computation is an automatic step and computes the number of all the possible sequences and sub-sequences of size n. This step helps to compute cumulative distribution functions in order to guide the final sequences of lexemes.

We will generate 10 random data of size 7, it means composed of 7 lexemes. That's why we will use the Hoa\Compiler\Llk\Sampler\Uniform class that takes our grammar in first argument, the lexeme value generator in second argument (here Hoa\Regex\Visitor\Isotropic) and finally a size:

$sampler = new Hoa\Compiler\Llk\Sampler\Uniform(
    // Grammar.
    Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')),
    // Token sampler.
    new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()),
    // Length.
    7
);

for ($i = 0; $i < 10; ++$i) {
    echo $i, ' => ', $sampler->uniform(), "\n";
}

/**
 * Will output:
 *     0 => [ false , null , null ]
 *     1 => [ " l " , null ]
 *     2 => [ [ true ] , true ]
 *     3 => [ [ [ 4435 ] ] ]
 *     4 => [ [ [ 9366 ] ] ]
 *     5 => [ true , false , null ]
 *     6 => { " |h<# " : false }
 *     7 => [ [ [ false ] ] ]
 *     8 => [ false , true , 7 ]
 *     9 => [ false , 5 , 79 ]
 */

We can redefine the size with the Hoa\Compiler\Llk\Sampler\Uniform::setLength method. We have noticed that some repetition operators have no upper bound, like + or *; in this case, we set it to n.

Bounded exhaustive generation

The second algorithm is the bounded exhaustive generation. This algorithm generates all sequences of lexemes (for the exhaustive side) from size 1 to n (for the bounded side). An interesting aspect of this algorithm is that the exhaustivity is kind of a proof! The algorithm behaves like an iterator and is very simple to use:

$sampler = new Hoa\Compiler\Llk\Sampler\BoundedExhaustive(
    // Grammar.
    Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')),
    // Token sampler.
    new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()),
    // Length.
    7
);

foreach ($sampler as $i => $data) {
    echo $i, ' => ', $data, "\n";
}

/**
 * Will output:
 *     0 => true
 *     1 => false
 *     2 => null
 *     3 => " 8u2 "
 *     4 => { " ,M@aj{ " : true }
 *     5 => { " x`|V " : false }
 *     6 => { " NttB " : null }
 *     7 => { " eJWwA " : 0 }
 *     8 => [ true ]
 *     9 => [ true , true ]
 *     10 => [ true , true , true ]
 *     11 => [ true , true , false ]
 *     12 => [ true , true , null ]
 *     13 => [ true , true , 32 ]
 *     14 => [ true , false ]
 *     15 => [ true , false , true ]
 *     16 => [ true , false , false ]
 *     17 => [ true , false , null ]
 *     18 => [ true , false , 729 ]
 *     19 => [ true , null ]
 *     20 => [ true , null , true ]
 *     21 => [ true , null , false ]
 *     …
 *     157 => [ 780 , 01559 , 32 ]
 *     158 => 344
 */

Like the previous algorithm, we can redefine the upper bound with the Hoa\Compiler\Llk\Sampler\BoundedExhaustive::setLength method. And the repetition operators with no upper bounds are bounded to n.

Grammar coverage-based generation

The last algorithm is the grammar coverage-based generation. This algorithm reduces the combinatory explosion encountered with the previous algorithm but the goal is different: it will generate sequences of lexemes that will “activate” all the rules branches of the grammar. A rule is said covered if and only if all its sub-rules are covered, and a lexeme is said covered if it has been used in a sequence. To ensure a diversity in the produced sequences, choice points between uncovered sub-rules are resolved randomly. The algorithm also behaves like an iterator:

$sampler = new Hoa\Compiler\Llk\Sampler\Coverage(
    // Grammar.
    Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp')),
    // Token sampler.
    new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random())
);

foreach ($sampler as $i => $data) {
    echo $i, ' => ', $data, "\n";
}

/**
 * Will output:
 *     0 => true
 *     1 => { " )o?bz " : null , " %3W) " : [ false , 130 , " 6 " ] }
 *     2 => [ { " ny  " : true } ]
 *     3 => { " Ne;[3 " : [ true , true ] , " th: " : true , " C[8} " : true }
 */

To avoid the combinatory explosion and to ensure the termination of the algorithm, we will use the following heuristic on the repetition operators: * will repeat 0, 1 and 2 times, + will repeat 1 and 2 times, and finally {x,y} will repeat x, x + 1, y - 1 and y times. This heuristic finds its origin in the limit testing.

We notice in our example that 4 data are generated and are enough to cover entirely our grammar!

Comparison between algorithms

Here are some clues to know when to use one algorithm instead of another.

Random uniform generation:
  • fast for small data, high data diversity and fixed size,
  • the pre-computation step is exponential and therefore long while, next, the generation is very fast.
Bounded exhaustive generation:
  • fast for small data and exhaustivity is efficient,
  • exponential number of data.
Grammar coverage-based generation:
  • fast for average or big data, and diversity of data,
  • do not consider size.

LL(1) compiler-compiler

The documentation of the LL(1) compiler, through the Hoa\Compiler\Ll1 class, is not written yet. The goal is different regarding Hoa\Compiler\Llk: there is no grammar description language, automata are hard-coded with arrays and this type of compiler is high performance oriented. That's why its behavior is monolitic.

Conclusion

Hoa\Compiler provides two compiler-compilers: Hoa\Compiler\Llk and Hoa\Compiler\Ll1, in order to validate data. The PP language allows to write algebraic grammars in a simple and natural way. The LL(k) compiler-compiler is split in distinct processus, which encourages hackability. Finally, several algorithms allows to generate data based on a grammar according to specific usages.