An Automatic Compiler Syntax Error Checker

using Binary/Tertiary Relations

 

ABSTRACT

We present a simple yet powerful syntax error checker that can be automatically

constructed from an input Context-Free grammar (CFG).

This checker tests every three possibly consecutive tokens of a source computer

program for their syntactic legitimacy.

Such a checker can be incorporated into the scanner of a compiler

before the parser can perform a complete syntax checking..

We construct the checker directly from production rules of an input CFG

as a 3-dimensional boolean array by compositions of some binary and

tertiary relations that specifies whether or not three given ordered tokens

of a source language can ever legitimately appear in that order

in any program of the source language.

Although clearly such an array only represents a limited syntax requirement

that any syntactically correct program must satisfy, our checker will

provide some benefits including an earlier syntax error detection

and the reduction of a burden imposed on the parser.

 

Before presenting the paper body, some preliminary comments are in oder:

 
          Key Ideas
 
1.    Automatic Checker Generation from CFG (Context-Free Grammar) 
       Production rules.
2     Checking Syntax Correctness of every two consecutive Tokens.
               {  int   x   ,   y   ;
                  cin  >   x  ;   }
3.    Checking Syntax Correctness of every three consecutive Tokens.          
                Clearly more powerful than #2 above.
 
               Example:
                              Valid pairs:
                                              int var
                                              var <
                              Invalid triple: 
                                              int  var  < 
                 ***Note that in the above example of invalid triple it is the case that 
                 (1)     while two consecutive tokens taken as a pair look syntactically correct
(2)      all three tokens taken together as a triple do not look correct and in fact
                           they represent a syntax error in most programming languages.
 
4.    Taking advantage of some characteristics of CFG
               a.            Independent derivation capability of nonterminals.
               b.            Invariant derivation sequences.
               c.             Identification of some useful relations.
               d.            Practicability: Useful in specifying some important language elements 
                                                         such as specifying
                                              Lists of any number of identical or similar items
                                              Operator Precedence
                                              Operator Associativity
                                              Order of Language Elements

 

I.      INTRODUCTION

 

The error checking capability of a typical compiler scanner is limited to finding

illegal tokens which are character strings appearing in the source program that cannot

possibly be recognized as tokens.

 

Hence, given a sequence of legal tokens each of which is legitimate by itself,

the question of whether or not the whole sequence represents a syntactically

correct program is left as a task of a parser.

 

We showed (4) an automatic construction of a simple error checker that tests

every two possibly consecutive tokens for their syntactic legitimacy

to be used by the scanner while recognizing individual tokens at the same time.

 

This paper is to extend above construction by considering three consecutive

tokens instead. While this extension appears to be rather straightforward,

the work involved turns out to be a bit substantial as we now have to

additionally consider some tertiary relations.

 

II.                ASSUMTIONS AND DEFINITIONS

 

Our basic approach is to automatically construct a 3-dimensional boolean

array of size t-by-t-by-t where t is the number of terminal symbols

of the input grammar defining the source language.

 

Each array element, TRIPLE[i, j, k], is true if and only if the i-th terminal,

the j-th terminal and k-th terminal of an input grammar can possibly

appear consecutively in that order in at least one syntactically correct program.

 

Before presenting the automatic and direct construction of this checker,

we need to define some simplifying assumptions and

useful relations on grammar symbols.

 

A. Definitions of Some Special Nonterminals

 

1.  A Single-terminal-deriving nonterminal

 

            A nonterminal is single-terminal-deriving if it can derive a string consisting

            only of a single terminal and no nonterminal at all, possibly among other

            strings.

 

2.  A purely-single nonterminal

 

            A single-terminal-deriving nonterminal is purely-single if it does not

            derive any string consisting of two or more terminals.

            That is, those nonterminals that can derive only one terminal and nothing more.

 

B. Simplifying Assumptions

 

1.  The input CFG has no empty-production-rule.

 

This assumption does not lose any generality for all practical purposes

as it is shown that (1,2,3) any CFG with empty-production-rule(s)

can be converted to an equivalent CFG with no such production rule

unless the grammar derives an empty string.

 

2.  Each nonterminal derives at least two terminals.

 

This assumption will very easily be satisfied if we introduce an additional

production rule to the grammar which we obtain by replacing

a single-terminal-deriving nonterminal appearing on the righthand

side (RHS) of a production rule with the terminal this nonterminal derives.

Furthermore, if this single-terminal-deriving nonterminal is purely-single,

the production rule can be deleted altogether from the grammar essentially

making the newly introduced production rule merely replace it.

For example, suppose that a nonterminal N is single-terminal-deriving

and it derives a string consisting only of a terminal t and consisting

of nothing else.

And suppose that the grammar has the following production rule

that has this nonterminal N appear on its RHS:                      M ->  p N q

So, in this case a newly introduced and added rule is:           M -> p t q

And, if this nonterminal N is purely-single, then above existing rule

of the grammar will have to be deleted from the grammar.

Of course, this introduction of a new rule must be done for every such

one-terminal string that this nonterminal N does derive,

not just the one mentioned above.

 

C.  Definitions of Binary Relations

 

1.   Leftmost-deriving

 

We say that the lefthand side nonterminal of a production rule leftmost-derives

the first symbol on the righthand side of the rule.

This relation certainly is transitive, as if a nonterminal N leftmost-derives

a nonterminal B which, in turn, leftmost-derives C, then the nonterminal N

leftmost-derives C (of course, taking one more step than the nonterminal B does).

 

2   Rightmost-deriving

 

Similarly, we define the relation of "Rightmost-deriving".

And, it is transitive and hence the number of necessary steps is immaterial

just like the above relation "Leftmost-deriving."

 

3   Adjacency

 

We say that two grammar symbols are adjacent if they appear

consecutively somewhere on the righthand side of at least

one production rule.

And, this relation is not transitive, unlike the other two.

 

D.   Definitions of Tertiary Relations

 

1.  Leftmost2-deriving

 

This tertiary relation is a natural extension of the Leftmost-deriving relation

of C(1) above as we say that the nonterminal on the lefthand side (LHS)

of a production rule leftmost2-derives the first two symbols

(and hence the two leftmost symbols) of the RHS string of the rule.

 

2.  Rightmost2-deriving

 

We similarly define this tertiary relation as the natural extension of

Rightmost-deriving relation of C(2) above     .

 

3.  Adjacency3

 

We say that three grammar symbols are adjacent3 if they appear

consecutively in that order at least once somewhere on the RHS

of any production rule of the grammar.

And, this tertiary relation is a natural extension of the Adjacency

relation of C(3) above.

 

III.     SOME NOTATIONS FOR RELATIONS

 

In efficiently representing various relations, we use following boolean matrices

and arrays with two suitable subscripts. 

These subscripts are:

            t:  representing the number of terminal symbols (and equivalently their set).

            n: representing the number of nonterminal symbols (and equivalently their set).

            v: representing the number of all grammar symbols, v=t+n,

    (and equivalently their set).

           

A.  Notations for Binary Relations.

 

For the aforementioned three binary relations, we use the following notations.

           

Lnv:  This matrix represents whether a certain symbol (any symbol)

        can be leftmost-derived by a certain nonterminal in exactly one step.

        We further decompose this matrix into two submatrices:

             Lnn,  for that portion for nonterminals deriving nonterminals

                     and

             Lnt ,  for that portion for nonterminals deriving terminals.

 

Rnv:  This matrix represents whether a certain symbol (any symbol)

        can be rightmost-derived by a certain nonterminal in exactly one step.

        We also decompose this matrix just like the one above into:

         Rnn and Rnt, similarly.

 

Avv:  This matrix represents whether any given ordered pair of two symbols are

         adjacent in that order.

         This matrix is decomposed into four submatrices:

                 Ann, Ant, Atn, and Att

 

B.   Notations for Tertiary Relations

      

 Similarly, we use the following notations for the three tertiary relations:

 L2nvv  for the one-step leftmost2-deriving relation that specifies whether a given 

 nonterminal leftmost2 derives two grammar symbols in question or not,

 R2nvv  for the one-step rightmost2-deriving relation that specifies whether a given

 nonterminal rightmost2 derives two grammar symbols in question or not, and

 A3vvv   for the Adjacency3 relation.

The computations of each of the first two are given in what follows

and the third does not need any computation as such as it can be

identified by visual inspection of the right-hand side strings

of all production rules of an input CFG in question.

 
1.   L2+ntt   
This array specifies whether or not a given nonterminal derives two given 
terminal symbols as the first two symbols in one or more steps. 
There are two cases to consider:
a.             the two leftmost derived terminals are derived at the same time (as they are 
               the first two symbols appearing on the RHS of a production rule)
b.            the two leftmost derived terminals are not derived at the same time. Instead, 
               a terminal and a nonterminal are the two leftmost derived symbols in that 
               order. And the second leftmost derived nonterminal derives a terminal as the 
               leftmost symbol in one or more steps.
 
The first case can be computed by:      L*nn  *   L2ntt 
   where L*nn is the reflexive transitive closure of Lnn  and the operator * is a 
   matrix multiplication except that 1+1 equals 1 as matrices are boolean. 
In this case, however, we have a matrix (2-dimensional) and a 
3-dimensional array to multiply, and we note that the intended multiplication 
needs to be performed with the last index of L2ntt  fixed.   
That is, the multiplication will be done on two 2-dimensional arrays 
(matrices) with the last index of  L2ntt  fixed.
On the other hand, the second case can be computed by:     L2ntn * L+nt       
    where the multiplication needs to be done with the first index of L2ntn fixed.
Combining and adding these two cases, we have:
               L2+ntt =  (L*nn  *  L2ntt ) +  (L2ntn * L+nt )  
               
2.  R2+ ntt
This array specifies whether or not a given nonterminal derives two given 
terminal symbols as the last two symbols in one or more steps. 
Just as in the case of L2+ ntt   there are two cases to consider: 
a.    the two rightmost derived terminals are derived at the same time (as they are the last 
       two symbols appearing on the RHS of a production rule) which can be computed by:   
            R* nn   *   R2+ ntt
b.    the two rightmost derived terminals are not derived at the same time. 
       Instead, a nonterminal and a terminal are the two rightmost derived symbols in that order. 
       And the first rightmost derived nonterminal of these two derives a terminal as the 
       rightmost symbol in one or more steps which can be computed by:  
                               transpose(R+nt) * R2nnt     
       where the multiplication needs to be done with the first index of  R2nnt  fixed.
Combining and adding these two cases, we have:  
               R2+ ntt  =    (R* nn   *   R2+ ntt )   +   transpose(R+nt) * R2nnt                                              

 

IV.     COMPUTATION OF A TRIPLE CHECKER

 

We need to consider the following seven distinct contributing cases

which will be added (as Boolean disjunction) to get the final array TRIPLE

which is a 3-dimensional boolean array.  

 

Case-1:    A3ttt  

Clearly, all instances of this array are instances of TRIPLEttt     

Because of our simplifying assumptions, we note that we do not have to

consider all subarrays of the 3-dimensional arrayA3.

 

Case-2:    A3ttn
In this case, A3ttn  * L+nt  contains instances of TRIPLE array where '*' denotes
the Boolean matrix multiplication and the multiplication needs to be done with the
first index of A3ttn  fixed.
 
Case-3:    A3ntt       
In this case, transpose(R+nt) * A3ntt  contains instances of TRIPLE array where
the multiplication needs to be made with the last index of A3ntt  fixed.
 
Case-4:    A3ntn
In this case, transpose(R+nt) * A3ntn * L+nt contains instances of TRIPLE array.
 

 

 

Case-5:            Atn      

In this case,  Atn  *   L2+ntt   contains instances of TRIPLE array where 

'*' denotes the boolean matrix multiplication where 1+1equals 1

and the multiplication needs to be done with the last index of  L2+ntt   fixed.

 

Case-6:            Ant

In this case,  contributing instances are to be obtained from

Transpose(R2+ntt) * Ant where Transpose(R2+ntt) will be obtained

by making the currently first index the last one and the boolean multiplication

will be done with the first index of the resulting transpose fixed.

 

Case-7:            Ann

In this case, we have two subcases to consider:

            Transpose(R+nt) * L2+ntt     and

            Transpose(R2+ntt) * L+nt

In the first subcase, after the transpose is done the multiplication will

be done with the last index of  L2+ntt  fixed.

In the second subcase, the transpose will be obtained by making the first index

the last one and the multiplication is done with the first index of

the resulting transpose array fixed.      

 

These seven contributing instances will be all added (boolean disjunction)

to obtain our final array TRIPLEttt   

 

It is clear to see that the computation is of complexity O(n*v*v) or 

O(n*n*t*t),  or at most O(n*n*t*t).

Clearly, this time complexity is higher than that of a 2-token checker

that checks just two consecutive tokens two at a time.

However, it is also clear that checking three consecutive tokens will

detect some syntax errors when a two-token checker fails.

 

The appendix shows some partial results obtained from an input CFG

that resembles a miniature PASCAL grammar.

For this input grammar, the following are a few examples of pairs of

two tokens that can appear legitimately consecutively in that order:                                    

id     ,

                                    id    +

                                    id     =

                                    var  id

                                    *     id

 

However, by conbining two of some of these, we get a triple of

three illegal consecutive tokens such as:        

var  id  +

                                    var  id  =

                                    *     id   ,

                                    *     id   =

 

And, the partial result of TRIPLE array of Appendix A.6 shows

that these are, in fact, illegal triples. 

We believe that while the time complexity of computing the checker is

relatively high, its application is very efficient as it is linear in the size

of input source program.

 

V.     CONCLUSION

 

We presented an automatic construction of a source program syntax error

checker based upon compositions of some useful binary and tertiary

relations we define that can be directly computed from the production rules

of an input Context-Free grammar

of the source language.

It is known that although not every feature of modern

programming languages can be specified by a Context-Free grammar

most features including the complete specification of tokens can be.

 

We believe that the benefits of such an error checker include

(1)   detecting relatively simple syntax errors at an earlier stage

of a compiling and consequently informing the programmers

of the presence of errors equally earlier and

      (2) reducing the burden of excessive error checking for the parser.

 

Our approach assumes that an input source grammar does not have an

Empty-Production-Rule merely for the sake of simplicity.

On the other hand, most practical programming languages do not

permit an empty program and hence even when an

input grammar includes an Empty-Rule we will be able to convert such grammars

into equivalent grammars without an Empty-Rule.

And, therefore, our simplification assumption does not lose any generality.

 

 

REFERENCES

 

(1)               Aho, Alfred V., Sethi, Ravi, and Ullman, Jeffrey D.,

Compilers: Principles, Techniques, and Tools, Addison Wesley, 1988.

(2)               Fischer, Charles N. and LeBlanc, Richard J.,

Crafting a Compiler with C, Benjamin/Cummings Publishing, 1991.

(3)               Hopcroft, John E. and Ullman, Jeffrey D.,

Introduction to Automata Theory, Languages, and Computation,

2nd edition, Addison Wesley, 2001.

(4)               Koh, Hikyoo,  A Simple Syntax Error Checker by Binary Relations,

Proceedings of Korea-US Science and Technology Symposium,

Chicago, Illinois, 1999.

 

 

APPENDIX:  SOME COMPUTATION RESULTS.

 

A.1.   INPUT CONTEXT-FREE GRAMMAR RULES

 

                  PASCAL :=  Program DECL ; BLOCK .                

                  DECL   :=  Var IDLIST : TYPE                     

                  DECL   :=  Var IDLIST : TYPE ; DECL              

                  IDLIST :=  Id                                    

                  IDLIST :=  Id , IDLIST                            

                  TYPE   :=  Integer                               

                  TYPE   :=  Real                                  

                  BLOCK  :=  Begin BODY End                        

                  BODY   :=  BODY ; S                               

                  BODY   :=  S                                     

                  S      :=  id = E                               

                  S      :=  BLOCK                                 

                  E      :=  E + T                                  

                  E      :=  T                                     

                  T      :=  T * F                                 

                  T      :=  F                                     

                  F      :=  ( E )                                  

                  F      :=  id                                    

 

 

 

A.2.   GRAMMAR SYMBOLS (as numbered by our program)

 

                           1:::  PASCAL                                   

                           2:::  DECL      

                           3:::  IDLIST   

                           4:::  TYPE     

                           5:::  BLOCK    

                           6:::  BODY     

                           7:::  S        

                           8:::  E        

                           9:::  T        

                          10:::  F        

                          11:::  Program  

                          12:::  ;        

                          13:::  .        

                          14:::  var      

                          15:::  :        

                          16:::  id       

                          17:::  ,        

                          18:::  Integer  

                          19:::  Real     

                          20:::  Begin    

                          21:::  End      

                          22:::  =                                    

                          23:::  +        

                          24:::  *        

                          25:::  (        

                          26:::  )        

 

 

A.3.   LEFTMOST-DERIVATION MATRIX (L+nt)

 

 TERMS:  11  13  15  17  19  21  23  25

                     12  14  16  18  20  22  24  26

 --------------------------------------------------

 PASCAL   1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 DECL        0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

 IDLIST      0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

 TYPE         0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0

 BLOCK     0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

 BODY       0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0

 S                0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0

 E                0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0

 T                0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0

 F                0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0

 

 

A.4.   RIGHTMOST-DERIVATION MATRIX ( R+nt) 

 

 TERMS:  11  13  15  17  19  21  23  25

                     12  14  16  18  20  22  24  26

 --------------------------------------------------

 PASCAL   0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

 DECL        0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0

 IDLIST      0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

 TYPE         0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0

 BLOCK     0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

 BODY       0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

 S                0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

 E                0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1

 T                0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1

 F                0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1

 

 

A.5.    ADJACENCY MATRIX (Avv matrix)

 

 SYMBOLS:  1   3   5   7   9  11  13  15  17  19  21  23  25

                          2   4   6   8  10  12  14  16  18  20  22  24  26

 -------------------------------------------------------------------------

 PASCAL      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 DECL           0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 IDLIST         0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

 TYPE            0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 BLOCK         0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

 BODY           0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0

 S                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 E                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1

 T                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

 F                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 Program        0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 ;                     0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 .                     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 var                0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 :                     0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 id                   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0

 ,                     0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 Integer           0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 Real               0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 Begin             0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 End                0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 =                    0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 +                    0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 *                    0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 (                     0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 )                     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 


 

A.6.   Part of  TRIPLE ARRAY (with the first token of triples being fixed at ‘*’) 

 

 TERMS:  11  13  15  17  19  21  23  25

                     12  14  16  18  20  22  24  26

 -------------------------------------------------

 Program     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 ;                  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 .                  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 Var             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 :                  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 id                0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1

 ,                  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 Integer        0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 Real            0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 Begin          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 End             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 =                 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 +                 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 *                 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 (                  0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0

 )                  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0