Skip to content

Syntactic Specification

The syntactic specification defines the structure of your language in a dialect of BNF. Each production rule maps a nonterminal to a sequence of symbols on its right-hand side.

Here is an example that will be referenced in the sections below.

# Standard production rules
<Monologue> ::= <Greeting> <Words>
<Greeting> ::= <Salutation> COMMA <NAME>

# Alternative productions for `Salutation`.
<Salutation:Hi> ::= HI
<Salutation:Hello> ::= HELLO

# Matche zero or more `WORD`s.
<Words> **= <WORD>

Names

Kind Format Examples
Nonterminal (class name) PascalCase, angle-bracketed <Expr>, <Program>
Terminal (token name) UPPER_CASE PLUS, NUM
Field name (captured symbol) camelCase expr, num, rest
Subclass name PascalCase AddExpr, NilRest

Literals such as "+" and "," are not supported. Instead, define the literal as a token in the lexical section and refer to it here.

Start Symbol

The first rule defines the start symbol.

Monologue is the start symbol in the Monologue example above.

Matching nothing

It's also possible for a production to produce nothing (A.K.A. epsilon):

<OptElse:HasElse> ::= ELSE <Stmt>
<OptElse:NoElse>  ::=

Generated class/object structure

Each production rule defines the structure for the nonterminal named on the left-hand side, and generates a class. Returning to our Monologue example.

Monologuegreeting: Greetingwords: WordsGreetingsalutation: Salutationname: TokenSalutationHiHelloWordswordList: List<Token>

An the right-hand side, anything written in angle brackets becomes a field. If you don't specify a field name, plcc-ng generates one.

Capturing terminals

Wrap a terminal in angle brackets to capture it. The name of a field is the lower-cased name of the terminal. You may provide a different field name using :fieldname.

<Term> ::= NUM         # matches NUM but does not capture it
<Term> ::= <NUM>       # captures NUM as field `num`
<Term> ::= <NUM:age>   # captures NUM as field `age`

Providing a different field name is especially important when a rule has two terms on the right-hand side that would generate two fields with the same name.

<Pair> ::= <NUM:x> <NUM:y>

If we did not provide the names x and y, plcc-ng would have generated two fields with the same name num, which would not compile and run.

The type of a capture token field is Token.

Capturing nonterminals

All nonterminals are captured. Their field names will be the nonterminal name lower-cased. You may provide a different field name using :fieldname.

<Program> ::= <Expr>              # captures Expr as field `expr`
<Program> ::= <Expr:expression>   # captures Expr as field `expression`

Providing a different field name is especially important when a rule has two terms on the right-hand side that would generate two fields with the same name.

<Pair> ::= <Expr:left> <Expr:right>

If we did not provide the names left and right, plcc-ng would have generated two fields with the same name expr, which would not compile and run.

The type of a capture nonterminal field is the class with the same name as the nonterminal.

Alternative rules and subclasses

When a nonterminal has multiple rules, each alternative must be given a name. This name will be the class generated for that rule and will be a subclass of the nonterminal class.

<Expr:LitExpr> ::= <NUM:num>
<Expr:AddExpr> ::= PLUS <Expr:left> <Expr:right>

This generates:

abstract class Expr { }
class LitExpr extends Expr  { Token num; }
class AddExpr extends Expr  { Expr left; Expr right; }
class Expr:
    pass

@dataclass
class LitExpr(Expr):
    num: Token

@dataclass
class AddExpr(Expr):
    left: Expr
    right: Expr

Subclass names are required only when a non-terminal has more than one production.

Repetition rules

The **= form matches zero or more occurrences of a pattern, with an optional separator:

<Args>  **= <Expr:expr>
<Pairs> **= <WHOLE:x> <WHOLE:y> +COMMA

Captured symbols become parallel lists:

class Args { List<Expr> exprList; }
class Pairs { List<Token> xList; List<Token> yList; }
@dataclass
class Args:
    exprList: List[Expr]

@dataclass
class Pairs:
    xList: List[Token]
    yList: List[Token]

For example, for Pairs, assuming WHOLE matches an integer token, given this input:

2 3, 5 6, 7 8

xList and yList would contains the following:

xList = [Token("2"), Token("5"), Token("7")]
yList = [Token("3"), Token("6"), Token("8")]

When given, the separator token must appear between each ocurrance of the right-hand side. Assuming WHOLE matches an integer token, valid inputs for Pairs include:

1 2
3 4, 5 6
7 8,9 10  , 11 12

The separator token is not captured in the parse tree.

Parse algorithm

plcc-ng generates a top-down LL(1) parser. Every parsing decision must be resolvable using a single lookahead token. If multiple alternatives can match the same lookahead token, plcc-ng reports an LL(1) conflict.