The Language of Languages: META II, EBNF and ABNF (Part 2)
Intro
This article continues my exploration into formal grammars. If you haven’t read the first part yet, I highly recommend you start there before diving into this one. Otherwise, some things might seem unclear or not fully explained. And if you spot any inaccuracies or feel I should add something interesting, feel free to let me know.
META II
In 1963-1964, just a few years after the creation of BNF, Dewey Val Schorre — who was at UCLA at the time — developed a programming language called META II. Fun fact: 1963 also saw the development and standardization of the ASCII table.
The main purpose of this language? Building compilers. According to its documentation, META II was described as a “syntax-oriented language for writing compilers” and a language similar to BNF. It introduced several concepts and structures that, in one form or another, are still around today in extensions like EBNF and ABNF, which we’ll get to later.
META II replaced BNF’s recursive definitions with loops. To achieve this, a mathematical grouping using the symbols $(
and )
was introduced. Essentially, the following expression:
$(foo)
means “zero or more repetitions of foo”. For example, in a BNF rule like:
<expr> := <term> | <expr> + <term> | <expr> - <term>
you’d write something like:
expr = term
$( '+' term .out('ADD')
/ '-' term .out('SUB'));
A few things to note here:
- Target language symbols (terminals) are now defined using single quotes (like
+
and-
in example above). - Gone are the angle brackets, which imposed some restrictions on the character set.
- The metalanguage connector
::=
was replaced with=
. - Instead of the vertical bar
|
, a forward slash/
was used for alternatives. - Rules now end with a semicolon
;
. - Users were now dealing with equations rather than metalanguage rules, as was the case with BNF.
Let’s walk through how the grammar works, step by step.
The expression is evaluated left to right. First up, we check term
. If matching term
fails, the entire expression is discarded, and the parser moves on. If successful, the parser enters $(...)
. It then tries to match +
. If that fails, it goes to the alternative /
and tries -
. If either match works, the parser then tries to match the term
to the right of the operator. If both alternatives fail, meaning the second term
couldn’t be matched, the result is still successful, but only for the first term
we matched right after the =
. Easy enough, right? Now, the fun part: since $(...)
means zero or more iterations, after successfully matching the second term
, the loop repeats until the match fails. And that’s how the problem of recursive definitions in BNF was solved.
As for those quirky little .out('ADD')
bits — they’re just service code that outputs so-called opcodes to the standard output. Here’s how it works: if a match is successful, say - term
, we print ADD
.
I found it amusing that while the documentation insists on using a semicolon at the end of each equation, it actually uses a period followed by a comma instead (“due to keypunch limitations”). It reminded me of the ALGOL 60 report, where such issues were handled manually.
If we wanted to make a syntax element optional, META II offered a keyword: empty
.
subsecondary = '*' primary / empty ;
secondary = primary subsecondary ;
The first line means: either the *
symbol followed by the non-terminal primary
, or nothing at all.
META II also introduced three data types: identifiers (denoted by id
), strings (denoted by string
), and numbers (denoted by number
). Now, let’s look at an example in a target language:
A
A + B
A + B * C
(A + B) * C
This is a made-up algebraic language with operator precedence and grouping. To cover this language in META II grammar, you’d need to write something like this:
expr3 = id / '(' expr1 ')' ;
expr2 = expr3 ('*' expr2 / empty) ;
expr1 = expr2 ('+' expr1 / empty) ;
This isn’t a complicated grammar. Upon closer inspection, it all becomes pretty obvious, except for one concept we haven’t encountered yet — grouping. Note that in the first line, the parentheses are enclosed in quotes, indicating that the parentheses belong to the target language. Grouping in the second and third lines means that matching the group is only considered successful if each alternative inside the group matches entirely.
BNF was a formal language, not tied to any specific implementations or computations. But in META II, we see how things began to change.
The rest of the META II syntax is no more complicated than what I’ve shown here, and the language itself offers plenty of features. But we won’t dwell on it for too long—I just wanted to highlight some of the traits that we’ll explore further.
Later, in the 1970s, a well-known tool named yacc came along, using BNF productions that closely resembled those from META II. yacc is commonly used as a generator for LALR parsers, and its roots clearly trace back to BNF. When you hear that yacc uses BNF, keep in mind — it’s only partly true, as the original BNF was quite different.
EBNF Notation
EBNF (Extended Backus-Naur Form) is essentially an extended version of BNF. It’s a collection of improvements on the classic BNF format, designed to simplify and condense descriptions while maintaining the same expressive power.
The official year of EBNF’s adoption is considered to be 1996, with ISO being the standardizing organization. The journey of EBNF’s development wasn’t entirely smooth. There were objections at the time, with people pointing out that by 1996, everyone and their dog had already created their own version of notation, and that introducing yet another one called EBNF wasn’t going to solve much. They even claimed that ISO itself didn’t always use EBNF in its own documents. But let’s not get too caught up in those debates. Instead, let’s focus on what sets EBNF apart from BNF.
From a purely stylistic point of view, here are a few changes that became popular over 30 years of modifications and borrowing from the original BNF notation:
- Target language symbols are now defined using quotes (this was also done in META II).
- Angle brackets around non-terminals have been removed (again, much like META II).
- The metalanguage connector
::=
has been replaced with=
(META II strikes again!). - In EBNF, anything between
(*
and*)
is considered a comment. - A semicolon (
;
) or period (.
) now marks the end of a rule (thanks, META II).
But in my opinion, the most important changes were made to the very structure of grammatical definitions. Let’s go over some of them:
Optional Elements
We saw how this was done in META II, so EBNF wasn’t breaking new ground here. To indicate optional parts in a grammar, EBNF uses square brackets [ ]
. The rule below shows that the minus sign is optional:
term = [ "-" ] factor ;
For instance, if the values of factor
are decimal numbers, then some valid values for term
could be:
-1
0
123123
-987987
Repetitions
The idea of recursive definitions in EBNF is quite similar to META II’s approach. To indicate that an object (or group of objects) can repeat zero or more times, EBNF uses curly braces { }
. By “object,” we mean a terminal (a symbol from the target language) or a non-terminal. Let’s see how this works:
args = arg { "," arg } ;
This rule defines a list of arguments separated by commas. For example, if arg
is a sequence of ASCII characters in the range 65-90, here are some valid values for args
:
HOST, PORT, DATABASE
NAME
A, B, C, D
Let’s look at another example. Suppose we have a language with the following function declaration syntax:
function connect (string host, port, string password)
Here, we have a fictional language where the argument type can be omitted (in this case — port
). To define the grammar for the function arguments, we could write the following rules:
args = maybe typed { "," maybe typed } ;
maybe typed = [ type ] arg ;
As mentioned earlier, angle brackets in EBNF have been removed. Here’s where it gets tricky for the uninitiated reader — are maybe typed
two non-terminals or one?
Grouping
I’ve already shown you how grouping was introduced in META II. Now let’s see how it’s presented here. To describe the grammar of an arithmetic operation in BNF, you would write something like this:
<expr> ::= <term> <op> <expr>
<op> ::= - | +
In EBNF, such a format is allowed, but if we put stylistic differences aside, it’s much more concise to write it like this:
expr = term ( "-" | "+" ) expr ;
Concatenation (Sequences)
Concatenation refers to an ordered list (sequence) of syntactic terms separated by commas. The rule below describes an assignment followed by a semicolon and then a whitespace character, in that exact sequence:
end assignment = { assignment, ";", white space } ;
With that, I’ll wrap up this brief overview of the key differences. There’s plenty of documentation on EBNF, unlike BNF or META II. For further reading, I suggest starting with the following links:
ABNF Notation
By 1977, ARPANET had been using a few informal standards for text messages sent between its host computers. It became clear that these methods needed to be formalized, ensuring they supported what was becoming inevitable. The result of this effort was the “Standard for the Format of ARPA Network Text Messages” (RFC733), followed by an updated and expanded version in 1982 (RFC822). These documents describe a modified version of BNF, which was well-suited for ARPANET’s needs at the time. This modified version was referred to as the “Augmented Backus-Naur Form.”
It’s tricky to pinpoint the exact origins of ABNF from just the RFCs, as it’s unclear where else and when this notation was used. By that time, several modifications to BNF already existed. Major projects, whether in programming language development, message protocols, or data formats, often created their own customized versions of BNF. Eventually, ARPANET undertook the task of standardizing and organizing their changes to the original BNF. The result of this work was the “Augmented BNF for Syntax Specifications: ABNF” (RFC5234) in 2008.
However, it’s worth noting that by 2008, ABNF had already been in use within ARPANET for decades. And remember, EBNF didn’t come around until 1996, which means ABNF was already in the game about 20 years earlier. Many RFCs included a copy of the ABNF definition. It usually looked something like this: “Here’s a document where we describe some concepts, and at the beginning, we define the language we’ll use for that description.” It took many years, but eventually, an official document came out outlining the standard.
ABNF is very similar to EBNF, with a few key differences in how choices, optional elements, and repetitions are represented. Let’s go over the main differences.
Non-terminal names are defined using the regular expression \b[A-Za-z][-A-Za-z0-9]+\b
. The use of whitespace characters, which was allowed in the original BNF, is not permitted. In EBNF, remember, spaces could be used in defining syntactic terms.
Enclosing non-terminals in angle brackets is optional but can be used whenever it makes recognizing non-terminal names easier. To indicate choices, ABNF uses a forward slash /
instead of the vertical bar |
. Optional elements are enclosed in square brackets [ ]
, and language symbols are still represented using strings in quotes.
Repetitions are indicated using the asterisk *
. To specify that an element or group of elements should repeat n
times, the syntax n*
is used. To specify a range of repetitions from n to m
, the syntax n*m
is used. For example, the following definition in EBNF:
{ expansion }
would be written like this in ABNF:
*( expansion )
Also, if say we have a rule defining a number:
DIGIT = %x30-39
to specify a two-digit number, now we could use:
2DIGIT
To help illustrate these differences, here’s an example of date and time defined using ABNF, taken from RFC5322:
date-time = [ day-of-week "," ] date time [CFWS]
day-of-week = ([FWS] day-name) / obs-day-of-week
day-name = "Mon" / "Tue" / "Wed" / "Thu" /
"Fri" / "Sat" / "Sun"
date = day month year
day = ([FWS] 1*2DIGIT FWS) / obs-day
month = "Jan" / "Feb" / "Mar" / "Apr" /
"May" / "Jun" / "Jul" / "Aug" /
"Sep" / "Oct" / "Nov" / "Dec"
year = (FWS 4*DIGIT FWS) / obs-year
time = time-of-day zone
time-of-day = hour ":" minute [ ":" second ]
hour = 2DIGIT / obs-hour
minute = 2DIGIT / obs-minute
second = 2DIGIT / obs-second
zone = (FWS ( "+" / "-" ) 4DIGIT) / obs-zone
Conclusion
As we’ve journeyed through the evolution of formal grammar notations, from BNF’s foundational principles to the more expressive META II and further extensions like EBNF and ABNF, it becomes clear how integral these systems are to the development of compilers, parsers, and overall programming language design. Each notation brought innovations, adapting to the needs of their time while introducing concepts that continue to shape modern syntax.
What started as simple efforts to define language structures evolved into the powerful, flexible grammar notations that developers rely on today. META II’s loop-based handling of recursive definitions, the enhancements of EBNF, and the practical applications of ABNF show that, while the core ideas remain timeless, the implementation and usability of formal grammars have grown significantly.
Whether you’re working on building a compiler, writing a parser, or even exploring syntax definitions, understanding these notations gives you a strong foundation to tackle complex language constructs efficiently.