You don't need a CS degree to enjoy this stuff or learn more! I do not have a CS degree, or any degree at all. It took me a few years before I built up the courage to write an interpreter or compiler but I was compelled by peers who had built their own (and didn't attend college either).
There are numerous bloggers who dig deeper into the implementation of things and explain them. You just have to start asking "how do I build X from scratch".
This is very much the case! I started tinkering with compilers around 4 years ago when I was 16. If it was possible for me to do then, I'm sure it's possible for anyone else to learn too.
When I write articles like these, I think of what I would have liked to have read when I was first getting into the topic at hand.
Are there resources to learn compilers in general?°¨ It’s hard to find resources that are
* In text
* Free and accessible in the web
* Not lengthy (not a 1000 page book)
* Points to resources to read after.
Any suggestions?°¨
I have read the unfinished http://www.craftinginterpreters.com/ , but I would like to know more about compilers (that emit machine code).
For the majority of use cases a simple rule suffices: Write a parser which generates LLVM IR and feed the IR into LLVM. You only need to research two topics: parser generators and LLVM IR, and both are pretty simple skills to pick up.
There are a lot of cases where that doesn't work (eg. writing a functional language or a JIT or doing programming language research) but those are edge cases. In those cases you probably want to read one of Appel's books ("Modern compiler implementation in ML" for example), or the Dragon Book.
These books are kind of old. Are they still state of the art or is there something newer?
I had a compiler class at university but that was a while ago, so I would like to update my knowledge because there seems to be a lot of exiting stuff going on with compilers and transpilers.
The basics don't change. There was this rather fun talk on HN 2 weeks ago https://news.ycombinator.com/item?id=19657983 where he asserts that only 8 optimizations are necessary to get about 80% of best case performance, and all of those were catalogued in 1971.
While some projects within the compiler toolchain reinvent the wheel to varying degrees of increased performance/functionality (e.g. MIR, probably some parsing tools too, gold linker, lldb, etc.), the majority of projects these days take advantage of existing tooling to applying frontends and backends in different ways. Want a JavaScript to LLVM compiler? Map the JavaScript AST using an existing JavaScript parser to the LLVM AST and have LLVM generate code. The combinations are infinite and high quality. Most people (myself included) only experiment at the highest level.
All this is to say the barrier to entry in PL/compiler/interpreter implementation is only as high as you want it to be. You can do the whole thing yourself as a learning experience or you can use existing tools to finish a fairly-fully-featured project in a few hours over a few weeks.
As others have pointed out, a lot of compilers have basically stayed the same in the past 25 years. What's changed has been some new (generally special-purpose) minor optimizations, and the growth of compilers to consider less traditional domains.
The newer things:
* SLP vectorization. Instead of classic vectorization (where you have a loop that you convert to a vector), SLP vectorization tries to form vectors from a single basic block, and it can work quite well for the small SIMD units such as SSE.
* Polyhedral loop transformation. This is sort of the equivalent of the lattice-based dataflow analysis methodology, which is to say it's a very powerful, and slow, general-purpose technique which actually isn't used all that much in production compilers.
* Decompilation, disassembly, and other forms of static binary analysis have progressed a fair amount in the past decade or so.
* JITs have evolved a lot. In terms of what's missing from the classic compiler books, this is the big missing area. Things such as tracing, recompilation techniques, garbage collection, guard techniques, etc. have come a long way, and I'm not off-hand aware of any actual good book or paper here that covers most of the topic.
* Superoptimization is a technique that's still mostly in the academic phase, but you are seeing peephole optimizations and other information populated by ahead-of-time superoptimization (e.g., Regehr's Souper work).
* Symbolic and concolic execution is something else that's in the transition phase between academic and industry work. The most advanced use of concolic execution is fuzzing work where concolic execution is used to guide the fuzzer to generate specific inputs to test as many paths as possible.
Most of these topics wouldn't be covered in something as introductory as the Dragon Book (which itself has poor coverage of the major loop transformations), and generally only come into play in more specific scenarios.
Here, I would disagree with you in part. There are more than two topics that are needed.
First. Design your language if you are not going to use something that already exists. So you need to understand grammars and the various pitfalls. There are quite a number of tutorials on this subject freely available around the web.
Second. Understand what you want your language to actually mean, this is the subject of semantics. So if you are starting out, make a simple grammar and simple associated semantics. Again there is various material around the traps that you can avail yourself of.
Third. Have a look at the various parser generators out there and what they expect in terms of your language. There are many and each has its pros and cons. Again, there are various tutorials about the subject.
Fourth. If your language is relatively simple and has relatively simple semantics, see what is required for the generation of LLVM IR. If there is any complexity in the language and its semantics, LLVM IR may not necessarily be the way to go. So, back to the first and second points, keep your investigatory language simple in both its syntax (grammar) and its semantics.
Fifth. Have fun. The entire subject can be a delight to learn, but some of the material out there is deadly boring and will quickly turn most people off the subject. But it is actually fun. Son enjoy the time you spend learning about the subject.
100% agree. If starting now I think you should consider targeting mlir[1]. It fixes many problems you will have with llvm, and allows you to create language specific dialects and transformations. Impressive new work by a top-notch team led by LLVM creator Chris Lattner. It has LLVM as an optional target, so you get that for free.
For a parser, you should consider using tree-sitter[2]. Tree-sitter gives you live editor support for free. Impressive work by Max Brunsfeld.
only about 562 lines for the linker, 645 for the macro assembler and 204 lines if you want dwarf stubs in your binaries. Oh and it is cross platform too
I've got a work-in-progress series [0] (first three posts are up) on writing a compiler from scratch in JavaScript (no third-party libraries). The first two posts target amd64 assembly, the third post ports to LLVM IR. You may not learn much about the best way to build compilers [i.e. 1) using a parser generator and 2) not emitting text directly], but you'll start with minimal code/functionality and build up from there -- which I always find super useful in learning new things.
There are many series like mine throughout the internet that are generally useful aides when combined with the typical dense, classical compiler textbook. The single exception to "dense textbooks" on compilers is Nikolaus Wirth's Compiler Construction [1] where he takes you through writing a compiler from scratch in ~100 pages. This is one of my favorite compiler (and programming?) books by one of the coolest programming language researchers. But even it may not be super accessible as a first on compilers.
Compiler architecture is also a good metaphor for understanding what Common Lisp is all about.
Common Lisp = the programmable compiler & runtime all in one. Most CL implementations compile before calling eval. Read–eval–print loop is actually parse-compile-execute-print loop.
Front end = Read function does the lexical analysis. CL syntax is almost one-to-one with the AST, so lexer-parser part is simple. You can reprogram the lexer with reader macros if you want. This is the last place where CL code is text. After it's reader is done with it, the code is represented with CL data structures.
Middle End = CL has multiple passes with AST. First normal macros, then compiler macros operate here. They deal with CL data. Lists, symbols, structs, objects, arrays, numbers, etc. CL packages are symbol tables. AST and IR are all CL data structures.
Back End = finally the function 'compile' function compiles the end result into into native code. Saving the image and exiting the runtime is like compiling in normal languages.
Compiling will not change READ into PARSE. READ is still there.
> Most CL implementations compile before calling eval
Many or a lot, but I'm not sure actually most.
There are quite a lot which use an Interpreter in the REPL, but where one can compile on demand. SBCL by default compiles. LispWorks for example does not.
> After it's reader is done with it, the code is represented with CL data structures.
Right, but the data structures carry no syntactic information. If you write (1 2 +) the reader accepts it and does not know that this is invalid syntax.
> AST and IR are all CL data structures.
True, but not necessarily like s-expressions...
See for example data structures the SBCL compiler uses:
The funny thing about compilers is that one architecture has been established as the dominant architecture (the one described here) for probably the past 50 years. Almost without exception, every compiler is structured just about exactly the same. If you're interested in an alternative architecture, it's worth reading up about Nanopass compilers. While largely the same basic ideas, Nanopass tries to distill each discrete action into its own pass -- think of it like composing a bunch of programs using shell pipes. So rather than just Frontend=>Middle End=>Backend you end up with something like Lex=>Parse=>Remove Lambdas=>Remove Loops=>Remove Conditionals=>SSA=>Allocate Registers=>Eliminate Dead Code=>.....
When I worked on GPU compiler backends the middle and backend passes basically were a collection of more basic passes that did one thing, in sequence. I wouldn't call this an alternative architecture. I think it's just an implementation detail of the dominant architecture.
There is either nothing to learn here, or a spectacular insight into compilation, depending on where you are in your career.
Compilers are fun. It’s an invaluable experience having written one. There is no magic left in a computer once you’ve written a compiler.
If you have written a compiler, this article is amusing. You get to see a programmers eyes open.
If you have not written a compiler, you get some insight about how they work- but perhaps you get more insight about how you work.
Also, the author should constant fold his arithmetic example. It slows down compile time, but makes the program faster.
reply