November 20th, 2014

Why is it so hard to evolve a programming language?

Parsers.

We use weak parsing algorithms – often hand-written left-leaning recursive descent parsers. Sometimes PEGs. Usually with a lexing layer that treats keywords specially, annotating them as a particular part of speech without that being a function of the grammar, but the words themselves.

This makes writing a parser easy, particularly for those hand-written parsers. Keywords are also a major reason we can’t evolve languages: adding new words breaks old programs that were already using them.

The alternative is to push identification of keywords into the grammar, and out of the lexer. This means that part of speech for a word can be determined by where it’s used. This allows some weird language, but it keeps things working well. Imagine javascript letting you have var var =. It’s not ambiguous, since a keyword can’t appear as a variable name, positionally. The first var can’t be known whether it’s a keyword or variable name without some lookahead, though: var = would be a variable name and var foo would be a keyword.

This usually means using better parsers. Hand written parsers could maintain a couple tokens buffered state, allowing an unshift or two to put tokens back when a phrase doesn’t match; generated parsers can do better and use GLR, and a fully dynamic parser working off of the grammar as a data structure can use Earley’s algorithm.

These are problematic for PEGs though. They won’t backtrack and figure out which interpretation is correct. Once a PEG has chosen a part of speech for a word, it sticks. That’s the rationale behind its ordered choice operator: one must have clear precedence. It’s in essence an implicit way to mark which part of speech something is in a grammar.

Backward-incompatible changes

It’s always tempting to get a ‘clean break’ on a language; misfeatures build up as we evolve it. This is the biggest disservice we can do our users: a clean break breaks every program they have ever written. It’s a new language, and you’re starting fresh.

Ways forward

Pragmas. "use strict" being the one Javascript has. They’re ugly, they don’t scale that well, so they have to be kept to a minimum. Version selection form mutually exclusive pragmas. This is what Netscape and Mozilla did to opt in to new features: <script language='javascript1.8'>. The downside here is that versioning is coarse, and doesn’t let you mix and match features. Scoping "use strict" to the function in ES5 was smart, in that it allows us to use the lexical scope as a place where the language changes too.

The complexity with "use strict" is that it changes things more than lexically: Functions declared in strict mode behave differently, and if you’re clever, you can observe this from the outside, as a caller, and that’s a problem for backward compatibility.

Support multiple sub-languages. In a parser that can support combining grammars (Earley’s algorithm and combinator parsers for pure LL languages in particular are good at this, though PEGs are not). If someone elects a different language within a region of the program, this is possible. Language features can be left as orthogonal layers. How one would express that intent is unexplored, though. Too few people use the tools that would allow this.

Versions may really be the best path forward. Modular software can be composed out of multiple files, and with javascript in the browser in particular, we’ll have to devise other methods; transport of unparsed script is already complex.

We should separate the parser from the semantics of a language: Let there be one, two, even ten versions of the syntax available, and push down to a more easily versioned (or not at all) semantic layer. This is where Python fell down without needing to. The old cruft could have been maintained and reformed in terms of the new concepts from Python3.