I’m interested in:
Data structures for representing programs
Algorithms for translating between those representations
Analysis & manipulations of each representation; questions which are natural to ask/answer in each
Most books I can find on compilers/PLs tend to spend most of their time on the text representation (and algorithms for translating programs out of text, i.e. parsing) and the machine-code representation (and algorithms for translating programs into machine code). For purposes of this question, I’m not particularly interested in either of these representations—they’re not very natural data structures for representing programs, and we mostly use them because we have to.
I’m interested mainly in high-level “intermediate” representations. In terms of applications, I’m more interested in reasoning about about the code (e.g. control flow analysis, type checking) and manipulating the code (e.g. automated parallelization, high-level machine-independent optimization methods) rather than translating between any particular source/target format.
Questions:
Does anybody know of a good book (or other source) that focuses on high-level “intermediate” representations of programs?
Is there some other question I should be asking, e.g. a different term to search for?
On the meta-level, where else should I look/ask this question?
A good selection of topics on static analysis is in
F Nielson, HR Nielson, C Hankin. Principles of Program Analysis
Some prerequisites for it and other relevant things can be picked up from
K Cooper, L Torczon. Engineering a Compiler
HR Nielson, F Nielson. Semantics with Applications: An Appetizer
The Nanopass Framework is built for that:
“The nanopass framework provides a tool for writing compilers composed of several simple passes that operate over well-defined intermediate languages. The goal of this organization is both to simplify the understanding of each pass, because it is responsible for a single task, and to simplify the addition of new passes anywhere in the compiler.”
https://nanopass.org/
https://docs.racket-lang.org/nanopass/index.html
Not a very pointed answer, but a collection of leads:
There are good reasons for the time spent on them — they are more difficult than the parts that go in the middle, which is “merely” software engineering, although of an unusual kind.
There is also a dearth on resources on the topic. And because of that, it is actually fairly hard.
One reason is that the basics of it is quite simple: generate a tree as the output of parsing, then transform this tree. Generate derivative trees and graphs from these trees to perform particular analyses.
Phrased like that, it seems that knowledge on how to work with trees and graphs is going to serve you well, and that is indeed correct.
A good read (though with a very narrow focus) is that discussion of syntax tree architecture in Roslyn. The Roslyn whitepaper is also quite interesting though more oriented towards exposing compiler features to user.
Personally, I did some research on trying to implement name resolution (relating an identifier user to its declaration site) and typing as a reactive framework: you would define the typing rules for you language by defining inference rules, e.g. once your type the type of node A and the type of node B, you can derive the type of node C. The reactive part was then to simply find the applicable inference rules and run them.
The project didn’t really pan out. In reality, the logic ends looking quite obfuscated and it’s just easier to write some boring non-modular old code where the logic is readily apparent.
(Incidentally, fighting against this “it’s easier to just code it manually” effect — but in parsing — is what my PhD thesis is about.)
I might advise you to look at research done on the Spoofax language workbench. Spoofax includes declarative language to specify name binding, typing, semantics and more. These languages do not offer enormous flexibility but they cover the most common language idioms. Since those were codified in a formal system (the declarative language), it might tell you something about the structure of the underlying problem (… which is not really about something quite as simple as data structure selection, but there you have it).
I’d like to point out I have seen very convincing arguments to the contrary. One argument in particular was that while the data structures used to represent program will tend to change (for engineering reasons, supporting new features, …), the text representation stays constant. This was made in the context of a macro system, I believe (defending the use of quasiquotes).
Regarding machine code, it would be immensely useful even if we didn’t need to run code on CPUs. Look at virtual machines: they work with bytecode. A list of sequential instructions is just the extremum of the idea of translating high-level stuff into a more limited number of lower-level primitives that are easier to deal with.
For academic literature on the topic, I would like at the proceedings of the GPCE (Generative Programming: Concepts & Experiences) and SLE (Software Language Engineering) conferences.
I think there exists some program transformation framework out there, and you might also learn something from them, though in my experience they’re quite byzantine. One such is Rascal MPL (meta-programming language). Another is Stratego (part of Spoofax) (I read some papers on that one a while ago that were palatable).
So anyay, here goes. Hope it helps. You can contact me if you need more info!