An
Automatic Compiler Syntax Error Checker
using
Binary/Tertiary Relations
ABSTRACT
We
present a simple yet powerful syntax error checker that can be automatically
constructed
from an input Context-Free grammar (CFG).
This
checker tests every three possibly consecutive tokens of a source computer
program
for their syntactic legitimacy.
Such
a checker can be incorporated into the scanner of a compiler
before
the parser can perform a complete syntax checking..
We
construct the checker directly from production rules of an input CFG
as
a 3-dimensional boolean array by compositions of some binary and
tertiary
relations that specifies whether or not three given ordered tokens
of
a source language can ever legitimately appear in that order
in
any program of the source language.
Although
clearly such an array only represents a limited syntax requirement
that
any syntactically correct program must satisfy, our checker will
provide
some benefits including an earlier syntax error detection
and
the reduction of a burden imposed on the parser.
Before
presenting the paper body, some preliminary comments are in
oder:
Key Ideas
1. Automatic Checker Generation from CFG (Context-Free Grammar)
Production rules.
2 Checking Syntax Correctness of every two consecutive Tokens.
{ int x , y ;
cin > x ; }
3. Checking Syntax Correctness of every three consecutive Tokens.
Clearly more powerful than #2 above.
Example:
Valid pairs:
int var
var <
Invalid triple:
int var <
***Note that in the above example of invalid triple it is the case that
(1) while two consecutive tokens taken as a pair look syntactically correct
(2) all three tokens taken together as a triple do not look correct and in fact
they represent a syntax error in most programming languages.
4. Taking advantage of some characteristics of CFG
a. Independent derivation capability of nonterminals.
b. Invariant derivation sequences.
c. Identification of some useful relations.
d. Practicability: Useful in specifying some important language elements
such as specifying
Lists of any number of identical or similar items
Operator Precedence
Operator Associativity
Order of Language Elements
I.
INTRODUCTION
The
error checking capability of a typical compiler scanner is limited to finding
illegal
tokens which are character strings appearing in the source program that cannot
possibly
be recognized as tokens.
Hence,
given a sequence of legal tokens each of which is legitimate by itself,
the
question of whether or not the whole sequence represents a syntactically
correct
program is left as a task of a parser.
We
showed (4) an automatic construction of a simple error checker that tests
every
two possibly consecutive tokens for their syntactic legitimacy
to
be used by the scanner while recognizing individual tokens at the same time.
This
paper is to extend above construction by considering three consecutive
tokens
instead. While this extension appears to be rather straightforward,
the
work involved turns out to be a bit substantial as we now have to
additionally
consider some tertiary relations.
II.
ASSUMTIONS
AND DEFINITIONS
Our
basic approach is to automatically construct a 3-dimensional boolean
array
of size t-by-t-by-t where t is the number of terminal symbols
of
the input grammar defining the source language.
Each
array element, TRIPLE[i, j, k], is true if and only if the i-th terminal,
the
j-th terminal and k-th terminal of an input grammar can possibly
appear
consecutively in that order in at least one syntactically correct
program.
Before
presenting the automatic and direct construction of this checker,
we
need to define some simplifying assumptions and
useful
relations on grammar symbols.
A.
Definitions of Some Special Nonterminals
1. A Single-terminal-deriving
nonterminal
A nonterminal is single-terminal-deriving if it can derive a string
consisting
only of a single terminal and no nonterminal at all, possibly among
other
strings.
2. A purely-single
nonterminal
A single-terminal-deriving nonterminal is purely-single if it does
not
derive any string consisting of two or more
terminals.
That is, those nonterminals that can derive only one terminal and nothing
more.
B.
Simplifying Assumptions
1. The input CFG has no
empty-production-rule.
This
assumption does not lose any generality for all practical purposes
as
it is shown that (1,2,3) any CFG with empty-production-rule(s)
can
be converted to an equivalent CFG with no such production rule
unless
the grammar derives an empty string.
2. Each nonterminal derives at least two
terminals.
This
assumption will very easily be satisfied if we introduce an additional
production
rule to the grammar which we obtain by replacing
a
single-terminal-deriving nonterminal appearing on the righthand
side
(RHS) of a production rule with the terminal this nonterminal
derives.
Furthermore,
if this single-terminal-deriving nonterminal is purely-single,
the
production rule can be deleted altogether from the grammar essentially
making
the newly introduced production rule merely replace it.
For
example, suppose that a nonterminal N is single-terminal-deriving
and
it derives a string consisting only of a terminal t and consisting
of
nothing else.
And
suppose that the grammar has the following production rule
that
has this nonterminal N appear on its RHS:
M -> p N
q
So,
in this case a newly introduced and added rule is:
M -> p t q
And,
if this nonterminal N is purely-single, then above existing rule
of
the grammar will have to be deleted from the grammar.
Of
course, this introduction of a new rule must be done for every such
one-terminal
string that this nonterminal N does derive,
not
just the one mentioned above.
C. Definitions of Binary
Relations
1.
Leftmost-deriving
We
say that the lefthand side nonterminal of a production rule leftmost-derives
the
first symbol on the righthand side of the rule.
This
relation certainly is transitive, as if a nonterminal N leftmost-derives
a
nonterminal B which, in turn, leftmost-derives C, then the nonterminal N
leftmost-derives
C (of course, taking one more step than the nonterminal B does).
2
Rightmost-deriving
Similarly,
we define the relation of "Rightmost-deriving".
And,
it is transitive and hence the number of necessary steps is immaterial
just
like the above relation "Leftmost-deriving."
3 Adjacency
We
say that two grammar symbols are adjacent if they appear
consecutively
somewhere on the righthand side of at least
one
production rule.
And,
this relation is not transitive, unlike the other two.
D. Definitions of Tertiary
Relations
1. Leftmost2-deriving
This
tertiary relation is a natural extension of the Leftmost-deriving relation
of
C(1) above as we say that the nonterminal on the lefthand side (LHS)
of
a production rule leftmost2-derives the first two symbols
(and
hence the two leftmost symbols) of the RHS string of the
rule.
2.
Rightmost2-deriving
We
similarly define this tertiary relation as the natural extension of
Rightmost-deriving
relation of C(2) above
.
3. Adjacency3
We
say that three grammar symbols are adjacent3 if they appear
consecutively
in that order at least once somewhere on the RHS
of
any production rule of the grammar.
And,
this tertiary relation is a natural extension of the Adjacency
relation
of C(3) above.
III. SOME NOTATIONS FOR RELATIONS
In
efficiently representing various relations, we use following boolean matrices
and
arrays with two suitable subscripts.
These
subscripts are:
t: representing the number
of terminal symbols (and equivalently their set).
n: representing the number of nonterminal symbols (and equivalently their
set).
v: representing the number of all grammar symbols, v=t+n,
(and equivalently their
set).
A. Notations for Binary
Relations.
For
the aforementioned three binary relations, we use the following
notations.
Lnv: This matrix represents whether a certain
symbol (any symbol)
can
be leftmost-derived by a certain nonterminal in exactly one
step.
We further decompose this matrix into two
submatrices:
Lnn, for that portion for nonterminals
deriving nonterminals
and
Lnt , for that portion for nonterminals
deriving terminals.
Rnv: This matrix represents whether a certain
symbol (any symbol)
can
be rightmost-derived by a certain nonterminal in exactly one
step.
We
also decompose this matrix just like the one above into:
Rnn and Rnt, similarly.
Avv: This matrix represents whether any given
ordered pair of two symbols are
adjacent in that order.
This matrix is decomposed into four submatrices:
Ann, Ant, Atn, and
Att
B. Notations for Tertiary
Relations
Similarly, we use the following notations
for the three tertiary relations:
L2nvv for the one-step leftmost2-deriving
relation that specifies whether a given
nonterminal leftmost2 derives two grammar
symbols in question or not,
R2nvv for the one-step rightmost2-deriving
relation that specifies whether a given
nonterminal rightmost2 derives two
grammar symbols in question or not, and
A3vvv for the Adjacency3
relation.
The
computations of each of the first two are given in what follows
and
the third does not need any computation as such as it can be
identified
by visual inspection of the right-hand side strings
of
all production rules of an input CFG in question.
1. L2+ntt
This array specifies whether or not a given nonterminal derives two given
terminal symbols as the first two symbols in one or more steps.
There are two cases to consider:
a. the two leftmost derived terminals are derived at the same time (as they are
the first two symbols appearing on the RHS of a production rule)
b. the two leftmost derived terminals are not derived at the same time. Instead,
a terminal and a nonterminal are the two leftmost derived symbols in that
order. And the second leftmost derived nonterminal derives a terminal as the
leftmost symbol in one or more steps.
The first case can be computed by: L*nn * L2ntt
where L*nn is the reflexive transitive closure of Lnn and the operator * is a
matrix multiplication except that 1+1 equals 1 as matrices are boolean.
In this case, however, we have a matrix (2-dimensional) and a
3-dimensional array to multiply, and we note that the intended multiplication
needs to be performed with the last index of L2ntt fixed.
That is, the multiplication will be done on two 2-dimensional arrays
(matrices) with the last index of L2ntt fixed.
On the other hand, the second case can be computed by: L2ntn * L+nt
where the multiplication needs to be done with the first index of L2ntn fixed.
Combining and adding these two cases, we have:
L2+ntt = (L*nn * L2ntt ) + (L2ntn * L+nt )
2. R2+ ntt
This array specifies whether or not a given nonterminal derives two given
terminal symbols as the last two symbols in one or more steps.
Just as in the case of L2+ ntt there are two cases to consider:
a. the two rightmost derived terminals are derived at the same time (as they are the last
two symbols appearing on the RHS of a production rule) which can be computed by:
R* nn * R2+ ntt
b. the two rightmost derived terminals are not derived at the same time.
Instead, a nonterminal and a terminal are the two rightmost derived symbols in that order.
And the first rightmost derived nonterminal of these two derives a terminal as the
rightmost symbol in one or more steps which can be computed by:
transpose(R+nt) * R2nnt
where the multiplication needs to be done with the first index of R2nnt fixed.
Combining and adding these two cases, we have:
R2+ ntt = (R* nn * R2+ ntt ) + transpose(R+nt) * R2nnt
IV. COMPUTATION OF A
TRIPLE CHECKER
We
need to consider the following seven distinct contributing cases
which
will be added (as Boolean disjunction) to get the final array TRIPLE
which
is a 3-dimensional boolean array.
Case-1: A3ttt
Clearly,
all instances of this array are instances of TRIPLEttt
Because
of our simplifying assumptions, we note that we do not have to
consider
all subarrays of the 3-dimensional arrayA3.
Case-2: A3ttn
In this case, A3ttn * L+nt contains instances of TRIPLE array where '*' denotes
the Boolean matrix multiplication and the multiplication needs to be done with the
first index of A3ttn fixed.
Case-3: A3ntt
In this case, transpose(R+nt) * A3ntt contains instances of TRIPLE array where
the multiplication needs to be made with the last index of A3ntt fixed.
Case-4: A3ntn
In this case, transpose(R+nt) * A3ntn * L+nt contains instances of TRIPLE array.
Case-5:
Atn
In
this case, Atn * L2+ntt contains instances of TRIPLE array
where
'*'
denotes the boolean matrix multiplication where 1+1equals 1
and
the multiplication needs to be done with the last index of L2+ntt fixed.
Case-6:
Ant
In
this case, contributing instances
are to be obtained from
Transpose(R2+ntt)
* Ant where Transpose(R2+ntt) will
be obtained
by
making the currently first index the last one and the boolean multiplication
will
be done with the first index of the resulting transpose
fixed.
Case-7:
Ann
In
this case, we have two subcases to consider:
Transpose(R+nt) *
L2+ntt
and
Transpose(R2+ntt) *
L+nt
In
the first subcase, after the transpose is done the multiplication will
be
done with the last index of
L2+ntt
fixed.
In
the second subcase, the transpose will be obtained by making the first index
the
last one and the multiplication is done with the first index of
the
resulting transpose array fixed.
These
seven contributing instances will be all added (boolean disjunction)
to
obtain our final array TRIPLEttt
It
is clear to see that the computation is of complexity O(n*v*v) or
O(n*n*t*t), or at most O(n*n*t*t).
Clearly,
this time complexity is higher than that of a 2-token checker
that
checks just two consecutive tokens two at a time.
However,
it is also clear that checking three consecutive tokens will
detect
some syntax errors when a two-token checker fails.
The
appendix shows some partial results obtained from an input CFG
that
resembles a miniature PASCAL grammar.
For
this input grammar, the following are a few examples of pairs of
two
tokens that can appear legitimately consecutively in that order:
id
,
id
+
id
=
var id
*
id
However,
by conbining two of some of these, we get a triple of
three
illegal consecutive tokens such as:
var id
+
var id =
* id ,
* id =
And,
the partial result of TRIPLE array of Appendix A.6 shows
that
these are, in fact, illegal triples.
We
believe that while the time complexity of computing the checker is
relatively
high, its application is very efficient as it is linear in the size
of
input source program.
V.
CONCLUSION
We
presented an automatic construction of a source program syntax error
checker
based upon compositions of some useful binary and tertiary
relations
we define that can be directly computed from the production rules
of
an input Context-Free grammar
of
the source language.
It
is known that although not every feature of modern
programming
languages can be specified by a Context-Free grammar
most
features including the complete specification of tokens can
be.
We
believe that the benefits of such an error checker include
(1)
detecting
relatively simple syntax errors at an earlier stage
of
a compiling and consequently informing the programmers
of
the presence of errors equally earlier and
(2) reducing the
burden of excessive error checking for the parser.
Our
approach assumes that an input source grammar does not have an
Empty-Production-Rule
merely for the sake of simplicity.
On
the other hand, most practical programming languages do not
permit
an empty program and hence even when an
input
grammar includes an Empty-Rule we will be able to convert such grammars
into
equivalent grammars without an Empty-Rule.
And,
therefore, our simplification assumption does not lose any
generality.
REFERENCES
(1)
Aho,
Alfred V., Sethi, Ravi, and Ullman, Jeffrey D.,
Compilers:
Principles, Techniques, and Tools,
Addison Wesley, 1988.
(2)
Fischer,
Charles N. and LeBlanc, Richard J.,
Crafting
a Compiler with C,
Benjamin/Cummings Publishing, 1991.
(3)
Hopcroft,
John E. and Ullman, Jeffrey D.,
Introduction
to Automata Theory, Languages, and Computation,
2nd
edition, Addison Wesley, 2001.
(4)
Koh,
Hikyoo, A Simple Syntax Error Checker by Binary
Relations,
Proceedings
of Korea-US Science and Technology Symposium,
Chicago,
Illinois, 1999.
APPENDIX: SOME COMPUTATION
RESULTS.
A.1. INPUT CONTEXT-FREE GRAMMAR
RULES
PASCAL := Program DECL ; BLOCK .
DECL := Var IDLIST : TYPE
DECL := Var IDLIST : TYPE ; DECL
IDLIST := Id
IDLIST := Id , IDLIST
TYPE := Integer
TYPE := Real
BLOCK :=
Begin BODY End
BODY := BODY ; S
BODY := S
S := id = E
S := BLOCK
E := E + T
E := T
T := T * F
T := F
F := ( E )
F := id
A.2. GRAMMAR SYMBOLS (as numbered by
our program)
1::: PASCAL
2::: DECL
3::: IDLIST
4::: TYPE
5::: BLOCK
6::: BODY
7::: S
8::: E
9::: T
10::: F
11::: Program
12::: ;
13::: .
14::: var
15::: :
16::: id
17::: ,
18::: Integer
19::: Real
20::: Begin
21::: End
22::: =
23::: +
24::: *
25::: (
26::: )
A.3. LEFTMOST-DERIVATION MATRIX
(L+nt)
TERMS: 11
13 15 17
19 21 23
25
12 14 16
18 20 22
24 26
--------------------------------------------------
PASCAL 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
DECL 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0
IDLIST 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0
TYPE
0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
BLOCK 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0
BODY 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 0
S
0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
E
0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
T
0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
F
0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
A.4. RIGHTMOST-DERIVATION MATRIX (
R+nt)
TERMS: 11
13 15 17
19 21 23
25
12 14 16
18 20 22
24 26
--------------------------------------------------
PASCAL 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0
DECL 0 0
0 0 0 0 0 1 1 0 0 0 0 0 0 0
IDLIST 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0
TYPE
0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
BLOCK 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0
BODY 0 0 0 0 0
1 0 0 0 0 1 0 0 0 0 1
S
0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
E
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
T
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
F
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
A.5. ADJACENCY MATRIX
(Avv matrix)
SYMBOLS: 1 3 5 7 9 11
13 15 17
19 21 23
25
2 4 6 8 10
12 14 16
18 20 22
24 26
-------------------------------------------------------------------------
PASCAL 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
DECL
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
IDLIST
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
TYPE
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BLOCK
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
BODY
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
S
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1
T
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
F
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Program 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
;
0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
var
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
:
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
id
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0
,
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Integer
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Real
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Begin
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
End
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
=
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
*
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A.6. Part of TRIPLE ARRAY (with the first token of
triples being fixed at ‘*’)
TERMS: 11
13 15 17
19 21 23
25
12 14 16
18 20 22
24 26
-------------------------------------------------
Program 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0
;
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Var
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
id
0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1
,
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Integer 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
Real
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Begin
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
End
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
=
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
*
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(
0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0