The Language of Languages: Exploring BNF (Part 1)

2019-03-19 (Updated: 2024-12-17)

18 minute read

bnf, grammar, syntax

Intro

In this article, I’ll take you on a journey through grammars and the overarching systems used to describe them, like BNF, EBNF, and ABNF. Don’t expect pinpoint academic precision here — this is more of an introductory read, designed to get you acquainted with the essentials.

I wasn’t a firsthand witness to the events I’ll discuss, nor do I claim to provide flawless, definitive information. Everything you read is my take, and it may change over time. Think of it as a personal learning note that you might find useful.

The article is divided into several parts. This is the first, and you can find the second part here. By the time you finish reading, you should be able to identify and interpret the most commonly used notation for grammars. Along the way, I’ll point you to extra resources if you want to dig deeper into any specific topic.

Motivation

You’re probably wondering why you should care about this topic or how it could be useful to you. Honestly, I don’t know your reasons. I wrote it because I work closely with several programming languages these days. Curiosity drove me to learn more about the tools I use every day.

What initially sparked this project was the flood of conflicting information I kept encountering, whether from colleagues, books, or online resources. The contradictions were everywhere, and they left me confused enough to start organizing my own understanding. I also wanted to fill in the gaps in my knowledge.

I’m not going to list every conceivable field where this subject might prove useful. Still, here are a few areas where you might find it relevant:

You’re developing a static code analyzer.
You’re building a programming, markup, or descriptive language.
You want to add new features to an existing language.
You’re analyzing some textual data, like HTTP headers.
You’re working on language support for Emacs, IntelliJ, Atom, etc.
You’re creating a new template engine for your web framework.
You’re building a client to communicate with a daemon and need to learn how to read from a socket.
You want to “translate” code from one version of a language into another or even into a different target language.
You’d like to deepen your understanding of a language you’re already using.

Basically, if you ever use the word “parse” in your day-to-day work, this article is for you.

Defining Language

In computer science, languages define computational domains. For example, SQL operates within the domain of databases, while HTML serves as the markup language for web pages.

Programming languages, protocol specifications, query languages, file formats, template systems, formal languages, configuration files, markup languages, and meta-languages—they all dictate how we compute things. Each one places some constraints on how computations are carried out and, in essence, shapes the process itself.

But what shapes the languages? What defines their boundaries and sets the rules? Two things: syntax and semantics. Syntax is about what can be expressed (written or said) using the language, while semantics defines how that expression should be interpreted.

In an ideal world, both syntax and semantics should be formally defined. When “formal” is used in the context of languages, it refers to an abstract description not tied to any specific implementation. And while describing semantics in modern realities can be quite challenging — it’s often defined implicitly or scattered across the entire language specification—the syntax is usually presented in a more or less autonomous and explicit form in every language, even if not always formally.

Grammars: The Languages of Languages

The description of syntax is the job of grammars. Grammars are the “languages of languages.” Behind every language lies a grammar that defines its structure and sets the allowable lexemes. A language cannot exceed the boundaries of its grammar — it is entirely shaped by it. While multiple grammars can generate the same language, each grammar defines its language uniquely.

Panini’s Linguistics

The idea of formally describing and defining languages goes back at least to Panini (Pāṇini), who lived around the 5th century BCE and is considered a precursor to modern structural linguistics, generative grammars, semiotics, and logic. Panini’s contributions are foundational to traditional and contemporary systems for analyzing Sanskrit and remain influential from both historical and theoretical perspectives.

Panini’s grammar sought to develop a complete and highly concise system of grammatical analysis for Sanskrit. In simpler terms, it aimed to formalize a systematic and consistent approach to understanding Sanskrit grammar. This work laid the foundation for both traditional and modern systems of Sanskrit analysis and remains a subject of great historical and theoretical interest.

The influence of Panini’s grammar on Western linguistics has been significant. Over the past two centuries, it has shaped the development of Western grammatical theory at various stages. In the early 19th century, proponents of the comparative approach recognized Panini’s principles of morphological analysis. In the 20th century, the eminent linguist Leonard Bloomfield, known for his work on Indo-European languages, Tagalog, and Algonquian languages, drew on Panini’s system in his seminal work on Algonquian grammar.

Panini’s contribution to contemporary linguistics is so vast that an entire field — Paninian studies — has emerged. His compilation of Sanskrit verb roots continues to be reprinted and studied by students across India to this day. Modern linguistics acknowledges Panini’s grammar as the most complete generative grammar ever written, and many of the technical elements he introduced are still in use today.

For those looking to explore Panini’s linguistic system in greater depth, I recommend Paul Kiparsky’s The Encyclopedia of Language and Linguistics.

Our Modern Era

Panini’s notation holds power comparable to a modern system devised by John Backus, with both sharing many key characteristics. According to Noam Chomsky’s famous classification, there are four types of grammars:

In the world of computer science, context-free grammars reign supreme. The term “context-free” means that these languages don’t allow context-dependent rules. Take C, for example. The expression a * b can be interpreted in different ways depending on the context: either as pointer declaration or as a multiplication operation. Context-free grammars, however, neatly sidestep this kind of ambiguity by describing recursive syntactic structures in a way that doesn’t rely on such contextual cues.

This makes context-free grammars powerful enough to describe the recursive structure of many (though certainly not all) languages.

Components of a Context-Free Grammar

At the heart of every grammar lies a set of rules. These rules are the blueprint that defines the structure of the language. Each rule consists of two main parts: (1) a name, and (2) what that name can be expanded into — essentially, a breakdown of how that component is structured.

Let’s break it down with an example. Suppose we’re creating a grammar to parse English sentences. We could define a rule that says:

noun phrase (1) can be expanded into article noun (2)

This tells us that a noun phrase “can consist of” an article (like “the”) followed by a noun (like “dog”). Simple enough, right? So, according to this rule, the phrase “the dog” qualifies as a noun phrase. But what if we want to describe something a little more complicated, like an algebraic expression? We might define a rule that looks something like this:

expression (1) can be expanded into digit + digit (2)

Here we’re saying that an expression can consist of two digits with a plus sign in between, like 2 + 3. This is a simplified example, of course. In reality, the rule for an expression might look like this:

expression (1) can be expanded into expression + expression (2)

Now we’ve introduced recursion: an expression can consist of two smaller expression, with a + sign joining them. It’s this kind of recursive structure that allows context-free grammars to describe complex, nested syntactical elements, such as mathematical formulas or entire sentences.

In formal grammar notation, instead of saying “can be expressed as” or “can be expanded into”, we simply use an arrow →, like so:

noun-phrase → article noun
expression → expression + expression

This notation allows for clear and concise rules. And, of course, the right-hand side of the arrow can get as intricate as needed. For example, let’s consider the formula for the area of an isosceles triangle:

S = 1/2 * a^2 * sin(α)

From a grammatical standpoint, this can be broken down into:

S → expression
expression → expression * expression

Each part of the formula is recursively defined in terms of simpler expressions.

This kind of breakdown, where one element is defined in terms of others, is called production. A well-constructed grammar ensures completeness, meaning it covers all possible inputs. To give you a more concrete example, here’s a classic context-free grammar for arithmetic expressions:

expr → term + expr
expr → term

term → term * factor
term → factor

factor → ( expr )
factor → const

const → integer

The first rule tells us that an expr (expression) can either be a term followed by a plus sign and another expr, or just a term. Similarly, a term can either be a term multiplied by a factor or simply a factor. And so on, until we get to the const, which is just an integer.

The beauty of this is that each element in the grammar builds upon the simpler ones. Step by step, we reduce the complex into the simple, and finally, into indivisible units known as terminals. Terminals are the basic symbols of the language — things that can’t be broken down any further. Everything else is called a non-terminal.

Let’s make things clearer with some definitions:

Terminal: A terminal is an element that exists as-is in the language—something with a concrete, fixed value. In our example, numbers like 3 and 7 are terminals.
Non-terminal: A non-terminal represents some abstract construct, like an expression or a formula. It’s a placeholder for something more complex, which can eventually be broken down into terminals.

So how do we know that the expression 3 * 7 is valid under this grammar? Simple. We apply the rules:

expr can be a term
A term can be a term * factor
A term can also be just a factor
Thus, a term can be also factor ∗ factor
A factor can be a const * factor
A factor can also be const * const
A const can be integer

By following these steps, we “produce” the expression 3 * 7 in seven steps, proving that it’s a valid expression. Of course, we could also reverse the process, starting from the expression and breaking it down into its components.

It’s also important to note that every grammar needs a starting non-terminal, which represents the highest level of abstraction — the starting point from which everything else is derived.

BNF Notation

A Brief History

BNF (Backus-Naur Form) was developed as a system for describing grammars, specifically designed to be readable by humans.

According to Wikipedia, the formal abstract systems for string transformation rules were introduced and studied by mathematicians like Axel Thue (1914), Emil Post (1920s–40s), and Alan Turing (1936). Noam Chomsky, who was teaching linguistics and information theory at MIT in 1956, combined linguistics and mathematics by adopting what was essentially Thue’s formalism to describe the syntax of natural languages. He also made a clear distinction between generative rules and transformational rules of text.

BNF first came into use as a meta-language to describe the ALGOL programming language, which was later presented in the ALGOL 60 Report. John Backus, a developer of ALGOL and FORTRAN at IBM, proposed a “metalanguage” for describing the syntax of programming languages, which was applied to IAL, now known as ALGOL 58 (1959). Backus’s metalanguage used Chomsky’s context-free grammar notation.

Later, in 1963, Peter Naur referred to this notation as “Backus Normal Form” in the ALGOL Committee’s report, but Donald Knuth suggested the name Backus-Naur Form (BNF), as the notation wasn’t truly a “normal form” in the traditional sense. He pointed out the modifications introduced by Naur.

Thus, the notation for Chomsky’s context-free grammars became known as BNF (Backus-Naur Form). Many programming languages, data formats, and protocols have BNF descriptions in their specifications. Its formal nature means that it can be applied to practically any system. To write a BNF description, all you need is a text editor (or a sheet of paper) — it’s not tied to any specific programming language, though there are variations and extensions that cater to different needs.

In this article, we’ll stick to the original version of BNF, first used in ALGOL in 1958 and later revised in the ALGOL 60 Report.

Defining BNF

Each rule in Backus-Naur Form has the following structure:

<name> ::= <expansion>

BNF rules are production rules, which describe how components of a language are formed to create valid structures. A parser, in turn, reverses this process, breaking down language constructs.

Every non-terminal (in example above represented as <name>) in BNF is enclosed in angle brackets < >, whether it appears on the left or right side of the rule. A non-terminal is a metalinguistic variable, meaning its value is a sequence of symbols.

The symbol ::= means “is defined as”, “can be replaced by”, or “expands into”. In earlier versions of BNF, the symbol :≡ was used instead.

A terminal symbol refers to a literal, such as + or begin, or a literal class, such as integer. These are concrete values that can be expressed unambiguously in the language.

The expansion on the right-hand side of the rule can contain both terminals and non-terminals, linked either in sequence or by choice. Placing non-terminals and/or terminals side by side implies they must appear in the specified order. A vertical bar | represents a choice (or alternative).

The symbols ::= and | are called metalinguistic connectors. The entire rule, from start to finish, is referred to as a metalinguistic formula.

In BNF, it’s permissible to use spaces and other whitespace characters, such as in the following example:

<relational operator>

Or:

<any sequence of symbols not containing ` or ' >

Technically, the rules for naming non-terminals are defined by the regular expression [[:print:]] except for newline characters, which allows for entire “poems” to be written within angle brackets. Later versions simplified this rule to the regular expression: \b[A-Za-z][-A-Za-z0-9]+\b. One thing remains unchanged — non-terminal names are case-insensitive, meaning <rulename>, <Rulename>, and <RULENAME> are treated as the same entity.

It’s worth noting that in the original ALGOL 60 Report, the word or was overlined instead of using the vertical bar |. I’ve even seen early versions of the report where the symbols, like :≡, were handwritten, while the rest of the text was typewritten. In my opinion, the report itself is interesting to examine from three perspectives:

The target language — the language BNF was describing, in this case, ALGOL.
The publication language — the set of symbols used to present the documentation.
The hardware (or software) of that time, which could differ in the selection of characters and impose certain variations in publication.

The report itself clarified that the symbols used in were chosen for human readability, not due to any limitations in encoding, computer hardware, or mathematical constraints.

Examples

To give a concrete example, let’s take a look at the format of Canadian postal codes. Here are three valid codes:

K1N 6N5
M5W 2E4
X0A 1A1

To describe this format using BNF, we might write:

<postalcode> ::= <letter> <number> <letter> <number> <letter> <number>

Here, we see how sequences are expressed in grammars — one object follows another in the order required by the grammar. Back then, terms like “concatenation” were often used in place of “sequence.”

As you can see, we could also describe Canadian postal codes like this:

<postalcode> ::= <alnum> <alnum> <alnum>

     <alnum> ::= <letter> <number>

Let’s now explore a slightly more complex but classic grammar for algebraic expressions:

; expr can be expanded into term + expr or term

  <expr> ::= <term> | <expr> + <term>
  <term> ::= <factor> | <term> * <factor>
<factor> ::= ( <expr> ) | <const>

In this example, we’ve used both sequence and choice elements (with the | symbol). The first line can be read as “expr can be either a term or an expr followed by a + and another term.” For clarity, the rules are aligned on the left, which is acceptable. Notice that BNF allows comments; they start with a semicolon (;) and run until the end of the line.

In BNF grammars, terminals are lexemes that represent themselves (such as +, switch, or <<=) or the name of a literal class (like integer). Literal classes are usually defined using other means, such as regular expressions.

For a deeper dive into recursive definitions, let’s look at another example from the ALGOL 60 Report:

<ab> ::= ( | [ | <ab> ( | <ab> <d>

This formula defines a recursive rule for forming the value of <ab>. It indicates that <ab> can be either (, [, or, if there is some valid <ab>, it can be followed by another ( or some value of the non-terminal <d>. For example, if <d> represents decimal numbers, the following would be valid values for <ab>:

[(((1(37(
(12345(
(((
[86

And that wraps up our introduction to BNF. It’s a tool used for describing formal grammars and has been foundational in computer science since its inception.

Continuing with the topic of choice and recursive definitions, let’s derive a grammar rule for JavaScript variable names:

<identifier> ::= <prefix> | <letter> | <identifier> <digit> | <identifier> <letter>
<prefix> ::= _ | $
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<letter> ::= ...

For brevity, I’ve omitted the full definition of <letter>. So, what does this rule tell us? Essentially, an identifier, which we can use for naming variables, must start with an underscore (_), a dollar sign ($), or a letter (<letter>). The subsequent characters can be any of those same symbols or digits (0-9).

In formal grammars, it’s common to introduce a set of reserved and predefined lexemes. Let’s take a look at an example of how they might be written. Though their lexicographic representation is self-explanatory, I’ve added comments to the right to give you a feel for how such blocks were typically formatted back in the day:

        <empty> ::= ; an empty string
          <eol> ::= ; represents the system’s end-of-line specifier
        <space> ::= ; a space character
      <comment> ::= ; comment symbol ;
<connective-eq> ::= ; the metalinguistic connective ::=
<connective-or> ::= ; the metalinguistic connective |

Usually, there would be another non-terminal inserted between ::= and ;, something like <any sequence not containing ;>. Since this is not actual code but rather a specification, it’s perfectly acceptable:

<any sequence not containing ;>

Of course, if we wanted to define the grammar for BNF rules themselves, it would look something like this:

<syntax>         ::= <rule> | <rule> <syntax>
<rule>           ::= <opt-whitespace> < <rule-name> > <opt-whitespace>
                     <connective-eq> <opt-whitespace> <expression> <line-end>
<opt-whitespace> ::= <space> <opt-whitespace> | <empty>
<expression>     ::= <list> | <list> <opt-whitespace> <connective-or> <opt-whitespace>
                     <expression>
<line-end>       ::= <opt-whitespace> <eol> | <line-end> <line-end>
<list>           ::= <term> | <term> <opt-whitespace> <list>
<term>           ::= <literal> | < <rule-name> >
<literal>        ::= <text>
<text>           ::= <empty> | <character> <text>
<character>      ::= <letter> | <digit> | <symbol>
<rule-name>      ::= <letter> | <rule-name> <rule-char>
<rule-char>      ::= <letter> | <digit> | -

<digit>          ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

<symbol>         ::= <connective-or> | <space> | <comment> | # | $
                   | + | , | - | . | / | : | ! | > | = | <
                   | ? | @ | [ | \ | ] | ^ | _ | ` | ' | " | { | }
                   | ~ | % | & | ( | ) | *

<letter>         ::= A | B | C | D | E | F | G | H | I | J
                   | K | L | M | N | O | P | Q | R | S | T
                   | U | V | W | X | Y | Z | a | b | c | d
                   | e | f | g | h | i | j | k | l | m | n
                   | o | p | q | r | s | t | u | v | w | x
                   | y | z

In this example, <rule-name> and <text> need to be replaced with the declared rule names or literal text, respectively.

In the original ALGOL 60 Report, terminal symbols such as if, then, and other keywords were simply underlined to indicate their terminal status. This imposed some limitations on the lexemes that could be used. For instance, the report recommended avoiding reserved lexemes commonly used in mathematical operations, like abs, sign, and cos. However, this practice didn’t persist in later BNF modifications. It’s worth noting that formal grammars in theory of formal languages still follow this convention:

DIGIT → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

This assumes that whitespace characters are required for proper rule interpretation. There are numerous online examples claiming to be BNF that use quotes:

<rule> ::= "A" | "B" | "C"

But in the original BNF, there were no string literals or data types in this sense. So, those examples you may come across don’t represent true BNF.

To describe all possible terminal symbols, BNF uses explicit enumeration, which can sometimes be cumbersome and makes the notation more verbose. It’s common practice to introduce a symbol class, representing a range of characters. Line breaks are treated as the end of a rule. If a rule needs to be broken across multiple lines, the continuation is indented relative to the line that introduces the non-terminal. Zero indentation for continuation lines is not allowed in BNF:

<proper string> ::=
    <any sequence of symbols not containing ` or ' >
    | <empty>

Interestingly, I haven’t come across any strict guidelines on the number of whitespace characters required for indentation, nor have I found clarity on whether subsequent lines must maintain the same level of indentation. Could we write something like this?

<foo> ::= <bar>
    | <baz>
        | <buz>

More or less concrete specifications for indentation can be found in RFC 5234, but that document describes ABNF, which we’ll dive into in the second part of this series.

And with that, we conclude the first part of our journey into the world of BNF grammars.

Conclusion

Everything we’ve explored in this article covers the essentials of BNF notation. However, it’s important to recognize that BNF alone isn’t sufficient to describe a full-featured programming language. BNF is a formal grammar description system—it’s a tool to create specifications, not implementations. Think of it as a guide that a developer can follow, but not something that directly handles the dirty work.

In any real-world implementation, you’ll eventually need to deal with things like handling non-printable characters, defining end-of-line markers, or managing empty strings and NULL. You probably won’t write out every single letter in <letter>, opting instead for regular expressions or literal classes to simplify things. In practice, grammars are often supplemented by additional programmatic solutions.

There are numerous dialects and extensions to BNF, some of which we’ll explore in future discussions. But for now, remember that the original BNF, introduced in the ALGOL 60 report, is considered the classical version in computer science. It remains one of the oldest languages associated with computing that’s still in use today.

Stay tuned for Part 2 to dive deeper into META II, EBNF, ABNF and its history.