And as imagination bodies forth The forms of things unknown, the poet’s pen Turns them to shapes and gives to airy nothing A local habitation and a name.
—— William Shakespeare, A Midsummer Night’s Dream
随着想象力的不断涌现
未知事物的形式,诗人的笔
把它们变成形状,变成虚无
当地的居住地和名字。
(威廉·莎士比亚《仲夏夜之梦》)
The last chapter introduced variables to clox, but only of the global variety. In this chapter, we’ll extend that to support blocks, block scope, and local variables. In jlox, we managed to pack all of that and globals into one chapter. For clox, that’s two chapters worth of work partially because, frankly, everything takes more effort in C.
上一章介绍了clox中的变量,但是只介绍了全局变量。在本章中,我们将进一步支持块、块作用域和局部变量。在jlox中,我们设法将所有这些内容和全局变量打包成一章。对于clox来说,这需要两章的工作量,坦率的说,部分原因是在C语言中一切都要花费更多的精力。
But an even more important reason is that our approach to local variables will be quite different from how we implemented globals. Global variables are late bound in Lox. “Late” in this context means “resolved after compile time”. That’s good for keeping the compiler simple, but not great for performance. Local variables are one of the most-used parts of a language. If locals are slow, everything is slow. So we want a strategy for local variables that’s as efficient as possible.
但更重要的原因是,我们处理局部变量的方法与我们实现全局变量的方法截然不同。全局变量在Lox中是后期绑定的。这里的“后期”是指“在编译后分析”。这有利于保持编译器的简单性,但不利于性能。局部变量是语言中最常用的部分之一。如果局部变量很慢,那么一切都是缓慢的。因此,对于局部变量,我们希望采取尽可能高效的策略1。
Fortunately, lexical scoping is here to help us. As the name implies, lexical scope means we can resolve a local variable just by looking at the text of the program—locals are not late bound. Any processing work we do in the compiler is work we don’t have to do at runtime, so our implementation of local variables will lean heavily on the compiler.
幸运的是,词法作用域可以帮助我们。顾名思义,词法作用域意味着我们可以通过查看程序文本来解析局部变量——局部变量不是后期绑定的。我们在编译器中所做的任何处理工作都不必在运行时完成,因此局部变量的实现将在很大程度上依赖于编译器。
The nice thing about hacking on a programming language in modern times is there’s a long lineage of other languages to learn from. So how do C and Java manage their local variables? Why, on the stack, of course! They typically use the native stack mechanisms supported by the chip and OS. That’s a little too low level for us, but inside the virtual world of clox, we have our own stack we can use.
在现代,实现一门编程语言的好处是,可以参考已经发展了很长时间的其它语言。那么,C和Java是如何管理它们的局部变量的呢?当然是在堆栈上!它们通常使用芯片和操作系统支持的本地堆栈机制。这对我们来说有点太底层了,但是在clox的虚拟世界中,我们有自己的堆栈可以使用。
Right now, we only use it for holding on to temporaries—short-lived blobs of data that we need to remember while computing an expression. As long as we don’t get in the way of those, we can stuff our local variables onto the stack too. This is great for performance. Allocating space for a new local requires only incrementing the
stackTop
pointer, and freeing is likewise a decrement. Accessing a variable from a known stack slot is an indexed array lookup.
现在,我们只使用它来保存临时变量——我们在计算表达式时需要记住的短期数据块。只要我们不妨碍这些数据,我们也可以把局部变量塞到栈中。这对性能很有帮助。为一个新的局部变量分配空间只需要递增stackTop
指针,而释放也同样是递减的过程。从已知的栈槽访问变量是一种索引数组的查询。
We do need to be careful, though. The VM expects the stack to behave like, well, a stack. We have to be OK with allocating new locals only on the top of the stack, and we have to accept that we can discard a local only when nothing is above it on the stack. Also, we need to make sure temporaries don’t interfere.
不过,我们确实需要小心。虚拟机希望栈的行为就像,嗯,一个栈。我们必须接受只能在栈顶分配新的局部变量,而且我们必须接受只有局部变量上方的栈槽没有数据时,才能丢弃该变量。此外,我们还需要保证临时变量不受干扰。
Conveniently, the design of Lox is in harmony with these constraints. New locals are always created by declaration statements. Statements don’t nest inside expressions, so there are never any temporaries on the stack when a statement begins executing. Blocks are strictly nested. When a block ends, it always takes the innermost, most recently declared locals with it. Since those are also the locals that came into scope last, they should be on top of the stack where we need them.
方便的是,Lox的设计与这些约束条件是一致的2。新的局部变量总是通过声明语句创建的。语句不会嵌套在表达式内,所以当一个语句开始执行时,栈中没有任何临时变量。代码块是严格嵌套的。当一个块结束时,它总会带走最内部、最近声明的局部变量。因为这些也是最后进入作用域的局部变量,所以它们应该位于栈顶(我们期望它所在的位置)。
Step through this example program and watch how the local variables come in and go out of scope:
逐步执行这段示例代码,查看局部变量是如何进入和离开作用域的:
See how they fit a stack perfectly? It seems that the stack will work for storing locals at runtime. But we can go further than that. Not only do we know that they will be on the stack, but we can even pin down precisely where they will be on the stack. Since the compiler knows exactly which local variables are in scope at any point in time, it can effectively simulate the stack during compilation and note where in the stack each variable lives.
看到它们如何完美地适应堆栈了吗?看来,栈可以在运行时存储局部变量。但是我们可以更进一步。我们不仅知道它们会在栈上,而且我们甚至可以确定它们在栈上的精确位置。因为编译器确切地知道任何时间点上有哪些局部变量在作用域中,因此它可以在编译过程中有效地模拟堆栈,并注意每个变量在栈中的位置。
We’ll take advantage of this by using these stack offsets as operands for the bytecode instructions that read and store local variables. This makes working with locals deliciously fast—as simple as indexing into an array.
我们将利用这一点,对于读取和存储局部变量的字节码指令,把这些栈偏移量作为其操作数。这使得局部变量非常快——就像索引数组一样简单3。
There’s a lot of state we need to track in the compiler to make this whole thing go, so let’s get started there. In jlox, we used a linked chain of “environment” HashMaps to track which local variables were currently in scope. That’s sort of the classic, schoolbook way of representing lexical scope. For clox, as usual, we’re going a little closer to the metal. All of the state lives in a new struct.
我们需要在编译器中跟踪大量状态,以使整个程序运行起来,让我们就从那里开始。在jlox中,我们使用“环境”HashMap链来跟踪当前在作用域中的局部变量。这是一种经典的、教科书式的词法作用域表示方式。对于clox,像往常一样,我们更接近于硬件。所有的状态都保存了一个新的结构体中。
compiler.c,在结构体ParseRule后添加代码:
} ParseRule;
// 新增部分开始
typedef struct {
Local locals[UINT8_COUNT];
int localCount;
int scopeDepth;
} Compiler;
// 新增部分结束
Parser parser;
We have a simple, flat array of all locals that are in scope during each point in the compilation process. They are ordered in the array in the order that their declarations appear in the code. Since the instruction operand we’ll use to encode a local is a single byte, our VM has a hard limit on the number of locals that can be in scope at once. That means we can also give the locals array a fixed size.
我们有一个简单、扁平的数组,其中包含了编译过程中每个时间点上处于作用域内的所有局部变量4。它们在数组中的顺序与它们的声明在代码中出现的顺序相同。由于我们用来编码局部变量的指令操作数是一个字节,所以我们的虚拟机对同时处于作用域内的局部变量的数量有一个硬性限制。这意味着我们也可以给局部变量数组一个固定的大小。
common.h,添加代码:
#define DEBUG_TRACE_EXECUTION
// 新增部分开始
#define UINT8_COUNT (UINT8_MAX + 1)
// 新增部分结束
#endif
Back in the Compiler struct, the
localCount
field tracks how many locals are in scope—how many of those array slots are in use. We also track the “scope depth”. This is the number of blocks surrounding the current bit of code we’re compiling.
回到Compiler结构体中,localCount
字段记录了作用域中有多少局部变量——有多少个数组槽在使用。我们还会跟踪“作用域深度”。这指的是我们正在编译的当前代码外围的代码块数量。
Our Java interpreter used a chain of maps to keep each block’s variables separate from other blocks’. This time, we’ll simply number variables with the level of nesting where they appear. Zero is the global scope, one is the first top-level block, two is inside that, you get the idea. We use this to track which block each local belongs to so that we know which locals to discard when a block ends.
我们的Java解释器使用了一个map链将每个块的变量与其它块分开。这一次,我们根据变量出现的嵌套级别对其进行编号。0是全局作用域,1是第一个顶层块,2是它内部的块,你懂的。我们用它来跟踪每个局部变量属于哪个块,这样当一个块结束时,我们就知道该删除哪些局部变量。
Each local in the array is one of these:
数组中的每个局部变量都是这样的:
compiler.c,在结构体ParseRule后添加代码:
} ParseRule;
// 新增部分开始
typedef struct {
Token name;
int depth;
} Local;
// 新增部分结束
typedef struct {
We store the name of the variable. When we’re resolving an identifier, we compare the identifier’s lexeme with each local’s name to find a match. It’s pretty hard to resolve a variable if you don’t know its name. The
depth
field records the scope depth of the block where the local variable was declared. That’s all the state we need for now.
我们存储变量的名称。当我们解析一个标识符时,会将标识符的词素与每个局部变量名称进行比较,以找到一个匹配项。如果你不知道变量的名称,就很难解析它。depth
字段记录了声明局部变量的代码块的作用域深度。这就是我们现在需要的所有状态。
This is a very different representation from what we had in jlox, but it still lets us answer all of the same questions our compiler needs to ask of the lexical environment. The next step is figuring out how the compiler gets at this state. If we were principled engineers, we’d give each function in the front end a parameter that accepts a pointer to a Compiler. We’d create a Compiler at the beginning and carefully thread it through each function call . . . but that would mean a lot of boring changes to the code we already wrote, so here’s a global variable instead:
这与我们在jlox中使用的表示方式非常不同,但我们用它仍然可以回答编译器需要向词法环境提出的所有相同的问题。下一步是弄清楚编译器如何获取这个状态。如果我们是有原则的工程师,我们应该给前端的每个函数添加一个参数,接受一个指向Compiler的指针。我们在一开始就创建一个Compiler,并小心地在将它贯穿于每个函数的调用中……但这意味着要对我们已经写好的代码进行大量无聊的修改,所以这里用一个全局变量代替5:
compiler.c,在变量parser后添加代码:
Parser parser;
// 新增部分开始
Compiler* current = NULL;
// 新增部分结束
Chunk* compilingChunk;
Here’s a little function to initialize the compiler:
下面是一个用于初始化编译器的小函数:
compiler.c,在emitConstant()方法后添加代码:
static void initCompiler(Compiler* compiler) {
compiler->localCount = 0;
compiler->scopeDepth = 0;
current = compiler;
}
When we first start up the VM, we call it to get everything into a clean state.
当我们第一次启动虚拟机时,我们会调用它使所有东西进入一个干净的状态。
compiler.c,在compile()方法中添加代码:
initScanner(source);
// 新增部分开始
Compiler compiler;
initCompiler(&compiler);
// 新增部分结束
compilingChunk = chunk;
Our compiler has the data it needs, but not the operations on that data. There’s no way to create and destroy scopes, or add and resolve variables. We’ll add those as we need them. First, let’s start building some language features.
我们的编译器有了它需要的数据,但还没有对这些数据的操作。没有办法创建或销毁作用域,添加和解析变量。我们会在需要的时候添加这些功能。首先,让我们开始构建一些语言特性。
Before we can have any local variables, we need some local scopes. These come from two things: function bodies and blocks. Functions are a big chunk of work that we’ll tackle in a later chapter, so for now we’re only going to do blocks. As usual, we start with the syntax. The new grammar we’ll introduce is:
在能够使用局部变量之前,我们需要一些局部作用域。它们来自于两方面:函数体和代码块。函数是一大块工作,我们在后面的章节中处理,因此现在我们只做块6。和往常一样,我们从语法开始。我们要介绍的新语法是:
statement → exprStmt
| printStmt
| block ;
block → "{" declaration* "}" ;
Blocks are a kind of statement, so the rule for them goes in the
statement
production. The corresponding code to compile one looks like this:
块是一种语句,所以它的规则是在statement
生成式中。对应的编译代码如下:
compiler.c,在statement()方法中添加代码:
if (match(TOKEN_PRINT)) {
printStatement();
// 新增部分开始
} else if (match(TOKEN_LEFT_BRACE)) {
beginScope();
block();
endScope();
// 新增部分结束
} else {
After parsing the initial curly brace, we use this helper function to compile the rest of the block:
解析完开头的花括号之后,我们使用这个辅助函数7来编译块的其余部分:
compiler.c,在expression()方法后添加代码:
static void block() {
while (!check(TOKEN_RIGHT_BRACE) && !check(TOKEN_EOF)) {
declaration();
}
consume(TOKEN_RIGHT_BRACE, "Expect '}' after block.");
}
It keeps parsing declarations and statements until it hits the closing brace. As we do with any loop in the parser, we also check for the end of the token stream. This way, if there’s a malformed program with a missing closing curly, the compiler doesn’t get stuck in a loop.
它会一直解析声明和语句,直到遇见右括号。就像我们在解析器中的所有循环一样,我们也要检查标识流是否结束。这样一来,如果有一个格式不正确的程序缺少右括号,编译器也不会卡在循环里。
Executing a block simply means executing the statements it contains, one after the other, so there isn’t much to compiling them. The semantically interesting thing blocks do is create scopes. Before we compile the body of a block, we call this function to enter a new local scope:
执行代码块只是意味着一个接一个地执行其中包含的语句,所以不需要编译它们。从语义上讲,块所做的事就是创建作用域。在我们编译块的主体之前,我们会调用这个函数进入一个新的局部作用域:
compiler.c,在endCompiler()方法后添加代码:
static void beginScope() {
current->scopeDepth++;
}
In order to “create” a scope, all we do is increment the current depth. This is certainly much faster than jlox, which allocated an entire new HashMap for each one. Given
beginScope()
, you can probably guess whatendScope()
does.
为了“创建”一个作用域,我们所做的就是增加当前的深度。这当然比jlox快得多,因为jlox为每个作用域分配了全新的HashMap。有了beginScope()
,你大概能猜到endScope()
会做什么。
compiler.c,在beginScope()方法后添加代码:
static void endScope() {
current->scopeDepth--;
}
That’s it for blocks and scopes—more or less—so we’re ready to stuff some variables into them.
这就是块和作用域的全部内容——或多或少吧——现在我们准备在其中添加一些变量。
Usually we start with parsing here, but our compiler already supports parsing and compiling variable declarations. We’ve got
var
statements, identifier expressions and assignment in there now. It’s just that the compiler assumes all variables are global. So we don’t need any new parsing support, we just need to hook up the new scoping semantics to the existing code.
通常我们会从解析开始,但是我们的编译器已经支持了解析和编译变量声明。我们现在已经有了var
语句、标识符表达式和赋值语句。只是编译器假设所有的变量都是全局变量。所以,我们不需要任何新的解析支持,我们只需要将新的作用域语义与已有的代码连接起来。
Variable declaration parsing begins in
varDeclaration()
and relies on a couple of other functions. First,parseVariable()
consumes the identifier token for the variable name, adds its lexeme to the chunk’s constant table as a string, and then returns the constant table index where it was added. Then, aftervarDeclaration()
compiles the initializer, it callsdefineVariable()
to emit the bytecode for storing the variable’s value in the global variable hash table.
变量声明的解析从varDeclaration()
开始,并依赖于其它几个函数。首先,parseVariable()
会使用标识符标识作为变量名称,将其词素作为字符串添加到字节码块的常量表中,然后返回它的常量表索引。接着,在varDeclaration()
编译完初始化表达式后,会调用defineVariable()
生成字节码,将变量的值存储到全局变量哈希表中。
Both of those helpers need a few changes to support local variables. In
parseVariable()
, we add:
这两个辅助函数都需要一些调整以支持局部变量。在parseVariable()
中,我们添加:
compiler.c,在parseVariable()方法中添加代码:
consume(TOKEN_IDENTIFIER, errorMessage);
// 新增部分开始
declareVariable();
if (current->scopeDepth > 0) return 0;
// 新增部分结束
return identifierConstant(&parser.previous);
First, we “declare” the variable. I’ll get to what that means in a second. After that, we exit the function if we’re in a local scope. At runtime, locals aren’t looked up by name. There’s no need to stuff the variable’s name into the constant table, so if the declaration is inside a local scope, we return a dummy table index instead.
首先,我们“声明”这个变量。我一会儿会说到这是什么意思。之后,如果我们在局部作用域中,则退出函数。在运行时,不会通过名称查询局部变量。不需要将变量的名称放入常量表中,所以如果声明在局部作用域内,则返回一个假的表索引。
Over in
defineVariable()
, we need to emit the code to store a local variable if we’re in a local scope. It looks like this:
在defineVariable()
中,如果处于局部作用域内,就需要生成一个字节码来存储局部变量。它看起来是这样的:
compiler.c,在defineVariable()方法中添加代码:
static void defineVariable(uint8_t global) {
// 新增部分开始
if (current->scopeDepth > 0) {
return;
}
// 新增部分结束
emitBytes(OP_DEFINE_GLOBAL, global);
Wait, what? Yup. That’s it. There is no code to create a local variable at runtime. Think about what state the VM is in. It has already executed the code for the variable’s initializer (or the implicit
nil
if the user omitted an initializer), and that value is sitting right on top of the stack as the only remaining temporary. We also know that new locals are allocated at the top of the stack . . . right where that value already is. Thus, there’s nothing to do. The temporary simply becomes the local variable. It doesn’t get much more efficient than that.
等等,什么?是的,就是这样。没有代码会在运行时创建局部变量。想想虚拟机现在处于什么状态。它已经执行了变量初始化表达式的代码(如果用户省略了初始化,则是隐式的nil
),并且该值作为唯一保留的临时变量位于栈顶。我们还知道,新的局部变量会被分配到栈顶……这个值已经在那里了。因此,没有什么可做的。临时变量直接成为局部变量。没有比这更有效的方法了。
OK, so what’s “declaring” about? Here’s what that does:
好的,那“声明”是怎么回事呢?它的作用如下:
compiler.c,在identifierConstant()方法后添加代码:
static void declareVariable() {
if (current->scopeDepth == 0) return;
Token* name = &parser.previous;
addLocal(*name);
}
This is the point where the compiler records the existence of the variable. We only do this for locals, so if we’re in the top-level global scope, we just bail out. Because global variables are late bound, the compiler doesn’t keep track of which declarations for them it has seen.
在这里,编译器记录变量的存在。我们只对局部变量这样做,所以如果在顶层全局作用域中,就直接退出。因为全局变量是后期绑定的,所以编译器不会跟踪它所看到的关于全局变量的声明。
But for local variables, the compiler does need to remember that the variable exists. That’s what declaring it does—it adds it to the compiler’s list of variables in the current scope. We implement that using another new function.
但是对于局部变量,编译器确实需要记住变量的存在。这就是声明的作用——将变量添加到编译器在当前作用域内的变量列表中。我们使用另一个新函数来实现这一点。
compiler.c,在identifierConstant()方法后添加代码:
static void addLocal(Token name) {
Local* local = ¤t->locals[current->localCount++];
local->name = name;
local->depth = current->scopeDepth;
}
This initializes the next available Local in the compiler’s array of variables. It stores the variable’s name and the depth of the scope that owns the variable.
这会初始化编译器变量数组中下一个可用的Local。它存储了变量的名称和持有变量的作用域的深度8。
Our implementation is fine for a correct Lox program, but what about invalid code? Let’s aim to be robust. The first error to handle is not really the user’s fault, but more a limitation of the VM. The instructions to work with local variables refer to them by slot index. That index is stored in a single-byte operand, which means the VM only supports up to 256 local variables in scope at one time.
我们的实现对于一个正确的Lox程序来说是没有问题的,但是对于无效的代码呢?我们还是以稳健为目标。第一个要处理的错误其实不是用户的错,而是虚拟机的限制。使用局部变量的指令通过槽的索引来引用变量。该索引存储在一个单字节操作数中,这意味着虚拟机一次最多只能支持256个局部变量。
If we try to go over that, not only could we not refer to them at runtime, but the compiler would overwrite its own locals array, too. Let’s prevent that.
如果我们试图超过这个范围,不仅不能在运行时引用变量,而且编译器也会覆盖自己的局部变量数组。我们要防止这种情况。
compiler.c,在addLocal()方法中添加代码:
static void addLocal(Token name) {
// 新增部分开始
if (current->localCount == UINT8_COUNT) {
error("Too many local variables in function.");
return;
}
// 新增部分结束
Local* local = ¤t->locals[current->localCount++];
The next case is trickier. Consider:
接下来的情况就有点棘手了。考虑一下:
{
var a = "first";
var a = "second";
}
At the top level, Lox allows redeclaring a variable with the same name as a previous declaration because that’s useful for the REPL. But inside a local scope, that’s a pretty weird thing to do. It’s likely to be a mistake, and many languages, including our own Lox, enshrine that assumption by making this an error.
在顶层,Lox允许使用与之前声明的变量相同的名称重新声明一个变量,因为这在REPL中很有用。但在局部作用域中,这就有些奇怪了。这很可能是一个误用,许多语言(包括我们的Lox)都把它作为一个错误9。
Note that the above program is different from this one:
请注意,上面的代码跟这个是不同的:
{
var a = "outer";
{
var a = "inner";
}
}
It’s OK to have two variables with the same name in different scopes, even when the scopes overlap such that both are visible at the same time. That’s shadowing, and Lox does allow that. It’s only an error to have two variables with the same name in the same local scope.
在不同的作用域中有两个同名变量是可以的,即使作用域重叠,以至于两个变量是同时可见的。这就是遮蔽,而Lox确实允许这样做。只有在同一个局部作用域中有两个同名的变量才是错误的。
We detect that error like so:
我们这样检测这个错误10:
compiler.c,在declareVariable()方法中添加代码:
Token* name = &parser.previous;
// 新增部分开始
for (int i = current->localCount - 1; i >= 0; i--) {
Local* local = ¤t->locals[i];
if (local->depth != -1 && local->depth < current->scopeDepth) {
break;
}
if (identifiersEqual(name, &local->name)) {
error("Already a variable with this name in this scope.");
}
}
// 新增部分结束
addLocal(*name);
}
Local variables are appended to the array when they’re declared, which means the current scope is always at the end of the array. When we declare a new variable, we start at the end and work backward, looking for an existing variable with the same name. If we find one in the current scope, we report the error. Otherwise, if we reach the beginning of the array or a variable owned by another scope, then we know we’ve checked all of the existing variables in the scope.
局部变量在声明时被追加到数组中,这意味着当前作用域始终位于数组的末端。当我们声明一个新的变量时,我们从末尾开始,反向查找具有相同名称的已有变量。如果是当前作用域中找到,我们就报告错误。此外,如果我们已经到达了数组开头或另一个作用域中的变量,我们就知道已经检查了当前作用域中的所有现有变量。
To see if two identifiers are the same, we use this:
为了查看两个标识符是否相同,我们使用这个方法:
compiler.c,在identifierConstant()方法后添加代码:
static bool identifiersEqual(Token* a, Token* b) {
if (a->length != b->length) return false;
return memcmp(a->start, b->start, a->length) == 0;
}
Since we know the lengths of both lexemes, we check that first. That will fail quickly for many non-equal strings. If the lengths are the same, we check the characters using
memcmp()
. To get tomemcmp()
, we need an include.
既然我们知道两个词素的长度,那我们首先检查它11。对于很多不相等的字符串,在这一步就很快失败了。如果长度相同,我们就使用memcmp()
检查字符。为了使用memcmp()
,我们需要引入一下。
compiler.c,添加代码:
#include <stdlib.h>
// 新增部分开始
#include <string.h>
// 新增部分结束
#include "common.h"
有了这个,我们就能创造出变量。但是,它们会停留在声明它们的作用域之外,像幽灵一样。当一个代码块结束时,我们需要让其中的变量安息。
compiler.c,在endScope()方法中添加代码:
current->scopeDepth--;
// 新增部分开始
while (current->localCount > 0 &&
current->locals[current->localCount - 1].depth >
current->scopeDepth) {
emitByte(OP_POP);
current->localCount--;
}
// 新增部分结束
}
When we pop a scope, we walk backward through the local array looking for any variables declared at the scope depth we just left. We discard them by simply decrementing the length of the array.
当我们弹出一个作用域时,后向遍历局部变量数组,查找在刚刚离开的作用域深度上声明的所有变量。我们通过简单地递减数组长度来丢弃它们。
There is a runtime component to this too. Local variables occupy slots on the stack. When a local variable goes out of scope, that slot is no longer needed and should be freed. So, for each variable that we discard, we also emit an
OP_POP
instruction to pop it from the stack.
这里也有一个运行时的因素。局部变量占用了堆栈中的槽位。当局部变量退出作用域时,这个槽就不再需要了,应该被释放。因此,对于我们丢弃的每一个变量,我们也要生成一条OP_POP
指令,将其从栈中弹出12。
We can now compile and execute local variable declarations. At runtime, their values are sitting where they should be on the stack. Let’s start using them. We’ll do both variable access and assignment at the same time since they touch the same functions in the compiler.
我们现在可以编译和执行局部变量的声明了。在运行时,它们的值就在栈中应在的位置上。让我们开始使用它们吧。我们会同时完成变量访问和赋值,因为它们在编译器中涉及相同的函数。
We already have code for getting and setting global variables, and—like good little software engineers—we want to reuse as much of that existing code as we can. Something like this:
我们已经有了获取和设置全局变量的代码,而且像优秀的小软件工程师一样,我们希望尽可能多地重用现有的代码。就像这样:
compiler.c,在namedVariable()方法中替换1行:
static void namedVariable(Token name, bool canAssign) {
// 替换部分开始
uint8_t getOp, setOp;
int arg = resolveLocal(current, &name);
if (arg != -1) {
getOp = OP_GET_LOCAL;
setOp = OP_SET_LOCAL;
} else {
arg = identifierConstant(&name);
getOp = OP_GET_GLOBAL;
setOp = OP_SET_GLOBAL;
}
// 替换部分结束
if (canAssign && match(TOKEN_EQUAL)) {
Instead of hardcoding the bytecode instructions emitted for variable access and assignment, we use a couple of C variables. First, we try to find a local variable with the given name. If we find one, we use the instructions for working with locals. Otherwise, we assume it’s a global variable and use the existing bytecode instructions for globals.
我们不对变量访问和赋值对应的字节码指令进行硬编码,而是使用了一些C变量。首先,我们尝试查找具有给定名称的局部变量,如果我们找到了,就使用处理局部变量的指令。否则,我们就假定它是一个全局变量,并使用现有的处理全局变量的字节码。
A little further down, we use those variables to emit the right instructions. For assignment:
再往下一点,我们使用这些变量来生成正确的指令。对于赋值:
compiler.c,在namedVariable()方法中替换1行:
if (canAssign && match(TOKEN_EQUAL)) {
expression();
// 替换部分开始
emitBytes(setOp, (uint8_t)arg);
// 替换部分结束
} else {
And for access:
对于访问:
compiler.c,在namedVariable()方法中替换1行:
emitBytes(setOp, (uint8_t)arg);
} else {
// 替换部分开始
emitBytes(getOp, (uint8_t)arg);
// 替换部分结束
}
The real heart of this chapter, the part where we resolve a local variable, is here:
本章的核心,也就是解析局部变量的部分,在这里:
compiler.c,在identifiersEqual()方法后添加代码:
static int resolveLocal(Compiler* compiler, Token* name) {
for (int i = compiler->localCount - 1; i >= 0; i--) {
Local* local = &compiler->locals[i];
if (identifiersEqual(name, &local->name)) {
return i;
}
}
return -1;
}
For all that, it’s straightforward. We walk the list of locals that are currently in scope. If one has the same name as the identifier token, the identifier must refer to that variable. We’ve found it! We walk the array backward so that we find the last declared variable with the identifier. That ensures that inner local variables correctly shadow locals with the same name in surrounding scopes.
尽管如此,它还是很直截了当的。我们会遍历当前在作用域内的局部变量列表。如果有一个名称与标识符相同,则标识符一定指向该变量。我们已经找到了它!我们后向遍历数组,这样就能找到最后一个带有该标识符的已声明变量。这可以确保内部的局部变量能正确地遮蔽外围作用域中的同名变量。
At runtime, we load and store locals using the stack slot index, so that’s what the compiler needs to calculate after it resolves the variable. Whenever a variable is declared, we append it to the locals array in Compiler. That means the first local variable is at index zero, the next one is at index one, and so on. In other words, the locals array in the compiler has the exact same layout as the VM’s stack will have at runtime. The variable’s index in the locals array is the same as its stack slot. How convenient!
在运行时,我们使用栈中槽索引来加载和存储局部变量,因此编译器在解析变量之后需要计算索引。每当一个变量被声明,我们就将它追加到编译器的局部变量数组中。这意味着第一个局部变量在索引0的位置,下一个在索引1的位置,以此类推。换句话说,编译器中的局部变量数组的布局与虚拟机堆栈在运行时的布局完全相同。变量在局部变量数组中的索引与其在栈中的槽位相同。多么方便啊!
If we make it through the whole array without finding a variable with the given name, it must not be a local. In that case, we return
-1
to signal that it wasn’t found and should be assumed to be a global variable instead.
如果我们在整个数组中都没有找到具有指定名称的变量,那它肯定不是局部变量。在这种情况下,我们返回-1
,表示没有找到,应该假定它是一个全局变量。
Our compiler is emitting two new instructions, so let’s get them working. First is loading a local variable:
我们的编译器发出了两条新指令,我们来让它们发挥作用。首先是加载一个局部变量:
chunk.h,在枚举OpCode中添加代码:
OP_POP,
// 新增部分开始
OP_GET_LOCAL,
// 新增部分结束
OP_GET_GLOBAL,
And its implementation:
还有其实现13:
vm.c,在run()方法中添加代码:
case OP_POP: pop(); break;
// 新增部分开始
case OP_GET_LOCAL: {
uint8_t slot = READ_BYTE();
push(vm.stack[slot]);
break;
}
// 新增部分结束
case OP_GET_GLOBAL: {
It takes a single-byte operand for the stack slot where the local lives. It loads the value from that index and then pushes it on top of the stack where later instructions can find it.
它接受一个单字节操作数,用作局部变量所在的栈槽。它从索引处加载值,然后将其压入栈顶,在后面的指令可以找到它。
Next is assignment:
接下来是赋值:
chunk.h,在枚举OpCode中添加代码:
OP_GET_LOCAL,
// 新增部分开始
OP_SET_LOCAL,
// 新增部分结束
OP_GET_GLOBAL,
You can probably predict the implementation.
你大概能预测到它的实现。
vm.c,在run()方法中添加代码:
}
// 新增部分开始
case OP_SET_LOCAL: {
uint8_t slot = READ_BYTE();
vm.stack[slot] = peek(0);
break;
}
// 新增部分结束
case OP_GET_GLOBAL: {
It takes the assigned value from the top of the stack and stores it in the stack slot corresponding to the local variable. Note that it doesn’t pop the value from the stack. Remember, assignment is an expression, and every expression produces a value. The value of an assignment expression is the assigned value itself, so the VM just leaves the value on the stack.
它从栈顶获取所赋的值,然后存储到与局部变量对应的栈槽中。注意,它不会从栈中弹出值。请记住,赋值是一个表达式,而每个表达式都会产生一个值。赋值表达式的值就是所赋的值本身,所以虚拟机要把值留在栈上。
Our disassembler is incomplete without support for these two new instructions.
如果不支持这两条新指令,我们的反汇编程序就不完整了。
debug.c,在disassembleInstruction()方法中添加代码:
return simpleInstruction("OP_POP", offset);
// 新增部分开始
case OP_GET_LOCAL:
return byteInstruction("OP_GET_LOCAL", chunk, offset);
case OP_SET_LOCAL:
return byteInstruction("OP_SET_LOCAL", chunk, offset);
// 新增部分结束
case OP_GET_GLOBAL:
The compiler compiles local variables to direct slot access. The local variable’s name never leaves the compiler to make it into the chunk at all. That’s great for performance, but not so great for introspection. When we disassemble these instructions, we can’t show the variable’s name like we could with globals. Instead, we just show the slot number.
编译器将局部变量编译为直接的槽访问。局部变量的名称永远不会离开编译器,根本不可能进入字节码块。这对性能很好,但对内省(自我观察)来说就不那么好了。当我们反汇编这些指令时,我们不能像全局变量那样使用变量名称。相反,我们只显示槽号14。
debug.c,在simpleInstruction()方法后添加代码:
static int byteInstruction(const char* name, Chunk* chunk,
int offset) {
uint8_t slot = chunk->code[offset + 1];
printf("%-16s %4d\n", name, slot);
return offset + 2;
}
We already sunk some time into handling a couple of weird edge cases around scopes. We made sure shadowing works correctly. We report an error if two variables in the same local scope have the same name. For reasons that aren’t entirely clear to me, variable scoping seems to have a lot of these wrinkles. I’ve never seen a language where it feels completely elegant.
我们已经花了一些时间来处理部分关于作用域的奇怪的边界情况。我们确保变量遮蔽能正确工作。如果同一个局部作用域中的两个变量具有相同的名称,我们会报告错误。由于我并不完全清楚的原因,变量作用域似乎有很多这样的问题。我从来没有见过一种语言让人感觉绝对优雅15。
We’ve got one more edge case to deal with before we end this chapter. Recall this strange beastie we first met in jlox’s implementation of variable resolution:
在本章结束之前,我们还有一个边界情况需要处理。回顾一下我们第一次在jlox中实现变量解析时,遇到的这个奇怪的东西:
{
var a = "outer";
{
var a = a;
}
}
We slayed it then by splitting a variable’s declaration into two phases, and we’ll do that again here:
我们当时通过将一个变量的声明拆分为两个阶段来解决这个问题,在这里我们也要这样做:
As soon as the variable declaration begins—in other words, before its initializer—the name is declared in the current scope. The variable exists, but in a special “uninitialized” state. Then we compile the initializer. If at any point in that expression we resolve an identifier that points back to this variable, we’ll see that it is not initialized yet and report an error. After we finish compiling the initializer, we mark the variable as initialized and ready for use.
一旦变量声明开始——换句话说,在它的初始化式之前——名称就会在当前作用域中声明。变量存在,但处于特殊的“未初始化”状态。然后我们编译初始化式。如果在表达式中的任何一个时间点,我们解析了一个指向该变量的标识符,我们会发现它还没有初始化,并报告错误。在我们完成初始化表达式的编译之后,把变量标记为已初始化并可供使用。
To implement this, when we declare a local, we need to indicate the “uninitialized” state somehow. We could add a new field to Local, but let’s be a little more parsimonious with memory. Instead, we’ll set the variable’s scope depth to a special sentinel value,
-1
.
为了实现这一点,当声明一个局部变量时,我们需要以某种方式表明“未初始化”状态。我们可以在Local中添加一个新字段,但我们还是在内存方面更节省一些。相对地,我们将变量的作用域深度设置为一个特殊的哨兵值-1
。
compiler.c,在addLocal()方法中替换1行:
local->name = name;
// 替换部分开始
local->depth = -1;
// 替换部分结束
}
Later, once the variable’s initializer has been compiled, we mark it initialized.
稍后,一旦变量的初始化式编译完成,我们将其标记为已初始化。
compiler.c,在defineVariable()方法中添加代码:
if (current->scopeDepth > 0) {
// 新增部分开始
markInitialized();
// 新增部分结束
return;
}
That is implemented like so:
实现如下:
compiler.c,在parseVariable()方法后添加代码:
static void markInitialized() {
current->locals[current->localCount - 1].depth =
current->scopeDepth;
}
So this is really what “declaring” and “defining” a variable means in the compiler. “Declaring” is when the variable is added to the scope, and “defining” is when it becomes available for use.
所这就是编译器中“声明”和“定义”变量的真正含义。“声明”是指变量被添加到作用域中,而“定义”是变量可以被使用的时候。
When we resolve a reference to a local variable, we check the scope depth to see if it’s fully defined.
当解析指向局部变量的引用时,我们会检查作用域深度,看它是否被完全定义。
compiler.c,在resolveLocal()方法中添加代码:
if (identifiersEqual(name, &local->name)) {
// 新增部分开始
if (local->depth == -1) {
error("Can't read local variable in its own initializer.");
}
// 新增部分结束
return i;
If the variable has the sentinel depth, it must be a reference to a variable in its own initializer, and we report that as an error.
如果变量的深度是哨兵值,那这一定是在变量自身的初始化式中对该变量的引用,我们会将其报告为一个错误。
That’s it for this chapter! We added blocks, local variables, and real, honest-to-God lexical scoping. Given that we introduced an entirely different runtime representation for variables, we didn’t have to write a lot of code. The implementation ended up being pretty clean and efficient.
这一章就讲到这里!我们添加了块、局部变量和真正的词法作用域。鉴于我们为变量引入了完全不同的运行时表示形式,我们不必编写很多代码。这个实现最终是相当干净和高效的。
You’ll notice that almost all of the code we wrote is in the compiler. Over in the runtime, it’s just two little instructions. You’ll see this as a continuing trend in clox compared to jlox. One of the biggest hammers in the optimizer’s toolbox is pulling work forward into the compiler so that you don’t have to do it at runtime. In this chapter, that meant resolving exactly which stack slot every local variable occupies. That way, at runtime, no lookup or resolution needs to happen.
你会注意到,我们写的几乎所有的代码都在编译器中。在运行时,只有两个小指令。你会看到,相比于jlox,这是clox中的一个持续的趋势16。优化器工具箱中最大的锤子就是把工作提前到编译器中,这样你就不必在运行时做这些工作了。在本章中,这意味着要准确地解析每个局部变量占用的栈槽。这样,在运行时就不需要进行查找或解析。
-
Our simple local array makes it easy to calculate the stack slot of each local variable. But it means that when the compiler resolves a reference to a variable, we have to do a linear scan through the array.
Come up with something more efficient. Do you think the additional complexity is worth it?
我们这个简单的局部变量数组使得计算每个局部变量的栈槽很容易。但这意味着,当编译器解析一个变量的引用时,我们必须对数组进行线性扫描。
想出一些更有效的方法。你认为这种额外的复杂性是否值得?
-
How do other languages handle code like this:
其它语言中如何处理这样的代码:
var a = a;
What would you do if it was your language? Why?
如果这是你的语言,你会怎么做?为什么?
-
Many languages make a distinction between variables that can be reassigned and those that can’t. In Java, the
final
modifier prevents you from assigning to a variable. In JavaScript, a variable declared withlet
can be assigned, but one declared usingconst
can’t. Swift treatslet
as single-assignment and usesvar
for assignable variables. Scala and Kotlin useval
andvar
.Pick a keyword for a single-assignment variable form to add to Lox. Justify your choice, then implement it. An attempt to assign to a variable declared using your new keyword should cause a compile error.
许多语言中,对可以重新赋值的变量与不能重新赋值的变量进行了区分。在Java中,
final
修饰符可以阻止你对变量进行赋值。在JavaScript中,用let
声明的变量可以被赋值,但用const
声明的变量不能被赋值。Swift
将let
视为单次赋值,并对可赋值变量使用var
。Scala
和Kotlin
则使用val
和var
。选一个关键字作为单次赋值变量的形式添加到Lox中。解释一下你的选择,然后实现它。试图赋值给一个用新关键字声明的变量应该会引起编译错误。
-
Extend clox to allow more than 256 local variables to be in scope at a time.
扩展Lox,允许作用域中同时有超过256个局部变量。
Footnotes
-
函数参数也被大量使用。它们也像局部变量一样工作,因此我们将会对它们使用同样的实现技术。 ↩
-
这种排列方式显然不是巧合。我将Lox设计成可以单遍编译为基于堆栈的字节码。但我没必要为了适应这些限制对语言进行过多的调整。它的大部分设计应该感觉很自然。
这在很大程度上是因为语言的历史与单次编译紧密联系在一起,其次是基于堆栈的架构。Lox的块作用域遵循的传统可以追溯到BCPL。作为程序员,我们对一门语言中什么是“正常”的直觉,即使在今天也会受到过去的硬件限制的影响。 ↩ -
在本章中,局部变量从虚拟机堆栈数组的底部开始,并在那里建立索引。当我们添加函数时,这个方案就变得有点复杂了。每个函数都需要自己的堆栈区域来存放参数和局部变量。但是,正如我们将看到的,这并没有如你所想那样增加太多的复杂性。 ↩
-
我们正在编写一个单遍编译器,所以对于如何在数组中对变量进行排序,我们并没有太多的选择。 ↩
-
特别说明,如果我们想在多线程应用程序中使用编译器(可能有多个编译器并行运行),那么使用全局变量是一个坏主意。 ↩
-
仔细想想,“块”是个奇怪的名字。作为比喻来说,“块”通常意味着一个不可分割的小单元,但出于某种原因,Algol 60委员会决定用它来指代一个复合结构——一系列语句。我想,还有更糟的情况,Algol 58将
begin
和end
称为“语句括号”。 ↩ -
在后面编译函数体时,这个方法会派上用场。 ↩
-
担心作为变量名称的字符串的生命周期吗?Local直接存储了标识符对应Token结构体的副本。Token存储了一个指向其词素中第一个字符的指针,以及词素的长度。该指针指向正在编译的脚本或REPL输入语句的源字符串。
只要这个字符串在整个编译过程中存在——你知道,它一定存在,我们正在编译它——那么所有指向它的标识都是正常的。 ↩ -
有趣的是,Rust语言确实允许这样做,而且惯用代码也依赖于此。 ↩
-
暂时先不用关心那个奇怪的
depth != -1
部分。我们稍后会讲到。 ↩ -
如果我们能检查它们的哈希值,将是一个不错的小优化,但标识不是完整的LoxString,所以我们还没有计算出它们的哈希值。 ↩
-
当多个局部变量同时退出作用域时,你会得到一系列的
OP_POP
指令,这些指令会被逐个解释。你可以在你的Lox实现中添加一个简单的优化,那就是专门的OP_POPN
指令,该指令接受一个操作数,作为弹出的槽位的数量,并一次性弹出所有槽位。 ↩ -
把局部变量的值压到栈中似乎是多余的,因为它已经在栈中较低的某个位置了。问题是,其它字节码指令只能查找栈顶的数据。这也是我们的字节码指令集基于堆栈的主要表现。基于寄存器的字节码指令集避免了这种堆栈技巧,其代价是有着更多操作数的大型指令。 ↩
-
如果我们想为虚拟机实现一个调试器,在编译器中擦除局部变量名称是一个真正的问题。当用户逐步执行代码时,他们希望看到局部变量的值按名称排列。为了支持这一点,我们需要输出一些额外的信息,以跟踪每个栈槽中的局部变量的名称。 ↩
-
没有,即便Scheme也不是。 ↩
-
你可以把静态类型看作是这种趋势的一个极端例子。静态类型语言将所有的类型分析和类型错误处理都在编译过程中进行了整理。这样,运行时就不必浪费时间来检查值是否具有适合其操作的类型。事实上,在一些静态类型语言(如C)中,你甚至不知道运行时的类型。编译器完全擦除值类型的任何表示,只留下空白的比特位。 ↩