There are a few obvious answers: One is to simply not allow multiple statements on the same visual line (even if they are closely related and idiomatic). Another is to define the semicolon (or equivalent) as a separator, with the side effect that you can no longer have a single statement split across multiple visual lines. Another is to, along with the ‘separator’ solution, add an additional symbol for splitting long statements across multiple visual lines—as in earlier Visual Basic. And yet another option is to have a separator and “guess” whether they meant a line break to end a statement or not—as in Javascript and modern Visual Basic.
You can also mix approaches: optional semicolons, but use indentation to guess if it’s the same instruction or not. That way:
// 3 instructions
blah; blah
blah
// 2 instructions
blah blah
blah
// 1 instruction (indentation is significant!)
blah blah
blah
// This one is tricky. Id' say syntax error, or a warning
blah; blah
blah
// This looks better, more obvious to me: 2 instructions
blah; blah
blah
// Alternatively, this one may also count for 2 instructions
blah; blah
blah
// begin of a new block
appropriate_keyword blah blah
blah
// end of a block (one instruction in the inner block,
one instruction in the outer block).
blah
blah
// 2 instructions (but frankly, I'd provide a warning for the superfluous ";")
blah;
blah;
This should be flexible enough and unambiguous enough.
In Python, you are supposed to write a colon before you start a block, right?
So the rules can be rather simple:
colon, with indentation = start of a new block
colon, no indentation = an empty block (or a syntax error)
no colon, with indentation = continuing of the previous line
no colon, no indentation = next statement
semicolon = statement boundary
Block ends where the indentation returns to the level of the line that opened the block. Continued line ends when indentation returns to the level of the starting line. (Where “to the level” = to the level, or below the level.)
I spent a lot of time thinking about this, and now it seems to me that this is a wrong question. The right question is: “how to make the best legible language?” Maybe it will require some changes to the concept of “statement”.
Why one statement plus one statement makes two statements, but one expression plus one expression makes one expression; why “x=1; y=1;” is two units, but “(x == 1) && (y == 1)” is one unit? What happens if a statement is a part of an expression, in an inline anonymous function? Where should we place semicolons or line breaks then?
Sorry, I don’t have a good answer. As a half-good answer, I would go with the early VB syntax: the rule is unambiguous (unlike some JavaScript rules), and it requires a special symbol in a special situation (as opposed to using a special symbol in non-special situation).
Another half-good answer: use four-space tabs for “this is the next statement” and a half-tab (two spaces) for “here continues the previous line”. (If the statement has more than two lines, all the lines except the first one are aligned the same; the half-tabs don’t accumulate.)
Why one statement plus one statement makes two statements, but one expression plus one expression makes one expression; why “x=1; y=1;” is two units, but “(x == 1) && (y == 1)” is one unit?
Because a statement is the fundamental unit of an imperative language. If “x=1; y=1;” were one unit, it would be one statement. Technically, on another level, multiple statements enclosed in braces is a single statement. Your objection does suggest another solution I forgot to put in—ban arbitrarily complex expressions. Then statements are of bounded length and have no need to span multiple lines. The obvious example for a language that makes this choice is assembly.
What happens if a statement is a part of an expression, in an inline anonymous function? Where should we place semicolons or line breaks then?
You could ban inline anonymous functions, or require them to be a single expression. You could implement half of Lisp as named functions that are building blocks for your “single expression” anonymous functions, so this doesn’t necessarily lose expressive power.
As a half-good answer, I would go with the early VB syntax
That Microsoft changed it is weak evidence against it—it suggests that people really don’t like having to add that extra symbol. There is that ambiguity problem, though. (Javascript’s rule* technically requires an arbitrarily large amount of lookahead—I think the modern VB rule is more sane from a compiler perspective, but can still have annoying consequences)
Your “other half-good answer” isn’t really very distinct from the first: the half-tab takes the role of the special symbol; it being at the beginning of the line just changes how you specify the grammar. (Vim scripting is an example of an existing language that uses a symbol at the beginning of a line for continuations) It also creates an extra burden (even compared to current whitespace-sensitive languages like Python) to maintain the indentation correctly. In particular, it forbids you from adding lots of extra indentation to, for example, line up the second part of a statement with a similar element on the first line (think making a C-style function call, then indenting subsequent lines to the point where the opening bracket of the argument list was. Or indenting to the opening bracket of the innermost still-open group in general.)
*Technical note: Javascript’s rule is “put in a semicolon if leaving it out leads to a syntax error”. VB’s rule is, more or less, “continue the statement if ending it at the linebreak leads to a syntax error”. In general, this will lead to Javascript continuing statements in unexpected places, and will lead to VB terminating statements in unexpected places.
Because a statement is the fundamental unit of an imperative language.
I don’t believe this is true, at least not for the usual sense of “statement”, which is “code with side effects which, unlike an expression, has no type (not even unit/void) and does not evaluate to a value”.
You can easily make a language with no statements, just expressions. As an example, start with C. Remove the semicolon and replace all uses of it with the comma operator. You may need to adjust the semantics very slightly to compensate (I can’t say where offhand).
Presto, you have a statement-less language that looks quite functional: everything (other than definitions) is an expression (i.e. has a type and yields a value), and every program corresponds to the evaluation of a nested tree of expressions (rather than the execution of a sequence of statements).
Yet, the expressions have side effects upon evaluation, there is global shared mutable state, there are variables, there is a strict and well-defined eager order of evaluation—all the semantics of C are intact. Calling this a non-imperative language would be a matter of definition, I guess, but there’s no substantial difference between real C and this subset of it.
Because a statement is the fundamental unit of an imperative language.
So the question “what kind of language are we trying to make?” must be answered before “what syntax would make it most legible?”.
Assuming an imperative language, the simplest solution would be one command per line, no exceptions. There is a scrollbar at the bottom; or you can split a long line into more lines by using temporary variables.
No syntax can make all programs legible. A good syntax is without exceptions and without unnecessary clutter. But if the user decides to write programs horribly, nothing can stop them.
An important choice is whether you make formatting significant (Python-style) or not. Making formatting significant has an advantage that you would probably format your code anyway, so the formatting can carry some information that does not have to be written explicitly, e.g. by curly brackets. But people will complain that in some situations a possibility to use their own formatting would be better. You probably can’t make everyone happy.
What would you replace the semicolon with?
There are a few obvious answers: One is to simply not allow multiple statements on the same visual line (even if they are closely related and idiomatic). Another is to define the semicolon (or equivalent) as a separator, with the side effect that you can no longer have a single statement split across multiple visual lines. Another is to, along with the ‘separator’ solution, add an additional symbol for splitting long statements across multiple visual lines—as in earlier Visual Basic. And yet another option is to have a separator and “guess” whether they meant a line break to end a statement or not—as in Javascript and modern Visual Basic.
You can also mix approaches: optional semicolons, but use indentation to guess if it’s the same instruction or not. That way:
This should be flexible enough and unambiguous enough.
In Python, you are supposed to write a colon before you start a block, right?
So the rules can be rather simple:
colon, with indentation = start of a new block
colon, no indentation = an empty block (or a syntax error)
no colon, with indentation = continuing of the previous line
no colon, no indentation = next statement
semicolon = statement boundary
Block ends where the indentation returns to the level of the line that opened the block. Continued line ends when indentation returns to the level of the starting line. (Where “to the level” = to the level, or below the level.)
I spent a lot of time thinking about this, and now it seems to me that this is a wrong question. The right question is: “how to make the best legible language?” Maybe it will require some changes to the concept of “statement”.
Why one statement plus one statement makes two statements, but one expression plus one expression makes one expression; why “x=1; y=1;” is two units, but “(x == 1) && (y == 1)” is one unit? What happens if a statement is a part of an expression, in an inline anonymous function? Where should we place semicolons or line breaks then?
Sorry, I don’t have a good answer. As a half-good answer, I would go with the early VB syntax: the rule is unambiguous (unlike some JavaScript rules), and it requires a special symbol in a special situation (as opposed to using a special symbol in non-special situation).
Another half-good answer: use four-space tabs for “this is the next statement” and a half-tab (two spaces) for “here continues the previous line”. (If the statement has more than two lines, all the lines except the first one are aligned the same; the half-tabs don’t accumulate.)
Because a statement is the fundamental unit of an imperative language. If “x=1; y=1;” were one unit, it would be one statement. Technically, on another level, multiple statements enclosed in braces is a single statement. Your objection does suggest another solution I forgot to put in—ban arbitrarily complex expressions. Then statements are of bounded length and have no need to span multiple lines. The obvious example for a language that makes this choice is assembly.
You could ban inline anonymous functions, or require them to be a single expression. You could implement half of Lisp as named functions that are building blocks for your “single expression” anonymous functions, so this doesn’t necessarily lose expressive power.
That Microsoft changed it is weak evidence against it—it suggests that people really don’t like having to add that extra symbol. There is that ambiguity problem, though. (Javascript’s rule* technically requires an arbitrarily large amount of lookahead—I think the modern VB rule is more sane from a compiler perspective, but can still have annoying consequences)
Your “other half-good answer” isn’t really very distinct from the first: the half-tab takes the role of the special symbol; it being at the beginning of the line just changes how you specify the grammar. (Vim scripting is an example of an existing language that uses a symbol at the beginning of a line for continuations) It also creates an extra burden (even compared to current whitespace-sensitive languages like Python) to maintain the indentation correctly. In particular, it forbids you from adding lots of extra indentation to, for example, line up the second part of a statement with a similar element on the first line (think making a C-style function call, then indenting subsequent lines to the point where the opening bracket of the argument list was. Or indenting to the opening bracket of the innermost still-open group in general.)
*Technical note: Javascript’s rule is “put in a semicolon if leaving it out leads to a syntax error”. VB’s rule is, more or less, “continue the statement if ending it at the linebreak leads to a syntax error”. In general, this will lead to Javascript continuing statements in unexpected places, and will lead to VB terminating statements in unexpected places.
I don’t believe this is true, at least not for the usual sense of “statement”, which is “code with side effects which, unlike an expression, has no type (not even unit/void) and does not evaluate to a value”.
You can easily make a language with no statements, just expressions. As an example, start with C. Remove the semicolon and replace all uses of it with the comma operator. You may need to adjust the semantics very slightly to compensate (I can’t say where offhand).
Presto, you have a statement-less language that looks quite functional: everything (other than definitions) is an expression (i.e. has a type and yields a value), and every program corresponds to the evaluation of a nested tree of expressions (rather than the execution of a sequence of statements).
Yet, the expressions have side effects upon evaluation, there is global shared mutable state, there are variables, there is a strict and well-defined eager order of evaluation—all the semantics of C are intact. Calling this a non-imperative language would be a matter of definition, I guess, but there’s no substantial difference between real C and this subset of it.
So the question “what kind of language are we trying to make?” must be answered before “what syntax would make it most legible?”.
Assuming an imperative language, the simplest solution would be one command per line, no exceptions. There is a scrollbar at the bottom; or you can split a long line into more lines by using temporary variables.
No syntax can make all programs legible. A good syntax is without exceptions and without unnecessary clutter. But if the user decides to write programs horribly, nothing can stop them.
An important choice is whether you make formatting significant (Python-style) or not. Making formatting significant has an advantage that you would probably format your code anyway, so the formatting can carry some information that does not have to be written explicitly, e.g. by curly brackets. But people will complain that in some situations a possibility to use their own formatting would be better. You probably can’t make everyone happy.