Thanks to Longview Labs for sponsoring this article. Check out their work at arweavehub.com/weekly.
Disclaimer: I also collaborate with Longview Labs as a part-time contractor.
Programming languages are complex beasts, yet today they’re probably at the easiest in terms of use. There’s dynamic scripting languages, garbage collection, and many quality of life features which make building software a lot easier than it once was.
The art of building programming languages themselves is something I’ve been particularly interested in recently, and so I’ve been going down the rabbit hole and starting to build my own.
It’s broken a bunch of stereotypes about what I thought I knew about programming languages, and has made me appreciate the abstractions offered to us much more.
One of the biggest things I’ve come across is the idea of “compiled vs interpreted” languages is a slightly misleading one. I want to explore that a little more and get into a high level overview of compilers and interpreters, and how these two terms are a somewhat strange comparison.
Compilers
At university or online you may have been taught that programming languages fall into one of two types: compiled or interpreted.
In reality, it’s not always clear cut. A lot of programming languages mix the two rather than being firmly placed in one camp or the other.
To give a definition close to the widely regarded Dragon Book, a compiler takes some source code (i.e. your program), and converts it to a target language. Usually, what people mean when they refer to a “compiled language” is one where the target language is machine code. Your OS then loads this executable, and voila, you have a running program.
This doesn’t necessarily have to be the case, as “target language” does not specify machine code. But for the moment, let’s take it at face value. Most people mean languages like C or C++, and not languages like Python.
Interpreters
Interpreted languages similarly read in some source code and produce a target language, but the difference is in execution. The target language typically isn’t machine code, but rather some kind of intermediate form referred to as bytecode.
Interpreted languages have (surprise surprise) an interpreter, which is just a program which can read in this bytecode and execute it. This is the difference between the “traditional” definition of a compiled language and an interpreted language.
The truth of the matter is that an interpreter, really, is some kind of compiler by the given definition: it takes in some source code, and outputs a target language (bytecode). It just isn’t the machine code executable that people expect.
Let’s take a look at building programming languages, and how this leads to the exact implementation strategy of interpreted/compiled.
A whistle-stop tour of source code transformation
No matter how the language is eventually implemented, every language designer today starts the same way.
For a human to write your language’s source code, you need to do a couple of things.
Firstly, you have to decide on some syntax and features: integers, operators, and keywords for example. Secondly, you need to work out what the rules of your language are, and so can therefore decide what is and isn’t a valid program.
So, how do you go from your arbitrary language rules to a feature-complete language?
Source needs to undergo at least some transformation to be understood by an interpreter or a complier.
The first step is building a tokenizer, which takes in some source code of your language as a string, and pulls out all of the relevant tokens (i.e. integers, operators, keywords), and ignores the irrelevant parts, such as comments.
The next part, is to turn these tokens into something which actually has some kind of hierarchical structure, so you can work out how 1 + 2 * (4 - 2) should be evaluated, for example. This process is called parsing, and usually represents the code as a kind of tree called an Abstract Syntax Tree (AST).
This AST is eventually evaluated to see if a program is valid, and whether or not it can be executed. This is actually what some of the earliest interpreters did: they were essentially programs which read in source code, generated an AST, and decided what the output of a program was.
This AST is essentially a very simple form of intermediate representation (IR). It’s somewhere between source code and machine code. Most modern interpreters instead take a slightly different approach, but the process is still the same.
Instead of doing source code → tokens → AST , they do: source code → tokens → AST → bytecode . Then the bytecode is interpreted, as you’d expect.
So there’s a little bit of a separation here: when a programming language like Python runs your program, it first converts it (dare I say, compiles it) to bytecode, which is read and executed by the Python virtual machine (which interprets it).
There are reasons for the VM and IR preference to directly evaluating an AST, but I won’t cover them here.
Just-In-Time (JIT) compilation
Some programming languages blur the line between “compiled” and “interpreted” even further with a feature called Just-In-Time (JIT) compilation.
JIT compilation essentially detects when pieces of bytecode are being ran frequently, and compiles it to machine code for faster execution. A lot of languages implement this mixed approach, such as Java and Lua.
The long and short of what I’m trying to say is that most interpreted languages typically use some kind of compiling step, albeit this step converts source code to bytecode or IR, not machine code. JIT compilation creates legitimate machine code, and some interpreted languages make use of JIT, too.
If I haven’t managed to lose you yet then congratulations, and also sorry in advance because things are probably going to get more confusing from here.
Rolling your own compiler, and LLVM
If you want to build a language which isn’t interpreted, then it needs to find a way to (eventually) compile down to machine code.
The long withstanding heavyweight world champion of this criteria is the C programming language. You can write a program in C, compile it with gcc , and it will build the relevant assembly code for your device (and convert this to machine code).
The exact assembly it produces (and therefore the the final machine code) is dependent on which computer architecture you’re using. This is one reason that rolling your own compiler is especially tricky. If you want to build a compiler and have your language work on x86 (like most computers), you need to write the code to convert your source code into x86 assembly.
If someone’s using an ARM CPU, such as on the new Apple M-series chips, or mobile devices, you’ll need to write another compiler for that. Or what if you want to support 32-bit computers? Or embedded devices? You can see how this can become a monumental task very quickly, which is one of the many reasons why C is so ubiquitous.
The alternative which we’re fortunate to have in this day and age is to use LLVM, which is also a form of IR. Instead of having to write a compiler to convert your source code for each system, you instead write a compiler for LLVM IR, and produce this instead of raw assembly. LLVM then handles the conversion of its IR into all of the relevant machine architectures and targets.
This follows the same initial process as before: source code to AST, and then AST to something else. In this case, it’s AST to LLVM IR.
This lets you avoid the painstakingly time-consuming process of doing all of that yourself, and instead you can do the painstakingly time-consuming task of actually designing and building your programming language.
That being said, some languages do try and move away from LLVM once they have the manpower or facilities to do so (e.g. Zig) for different reasons, such as faster compile times.
The beauty of abstractions
I hope when you read this, you come away with the same feeling that I have: we’re truly standing on the shoulders of giants, and it’s incredible what’s been made possible with computers.
There are rabbit holes I haven’t even gone into in this post, for example the process of bootstrapping (using an earlier version of a programming language to write the next version), but hopefully this tale of compilers and interpreters has been enough of a journey for now.
Thanks for reading.
If you want to read more posts on functional programming, PL development, and self-learning, feel free to subscribe.