LEXICAL ELEMENTS OF PROGRAMMING LANGUAGES
-
The lexical elements of a programming language consist of the following:
-
the character set;
-
the rules for grouping characters into words (or lexems);
-
the use of reserved words or keywords;
-
the manner in which blanks and line termination characters are handled;
-
How comments are written.
-
Character set (vocabulary /alphabet):
Two approaches are taken when deciding on the character set of a programming language:
-
Choose all the characters deemed necessary
Examples: APL, ALGOL 90
Drawbacks:
- use special I/O equipments or
- make changes to the published language when it is used on a computer.
-
Choose only the characters commonly available with current I/O devices
Examples: FORTRAN, COBOL, PASCAL
- The most commonly used character sets are:the ASCII character set and the EBCDIC character set.
-
Nonprintable characters may be represented by their decimal equivalent or some other form
Examples: escape sequences in C, and EOF for end of file character.
-
In the definition of programming languages, characters that are treated identically by the compiler are often grouped into classes.
Examples:
Digits = (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
Letters = (A | B | C | ... | Z)
White spaces = blanks, tabs, carriage return . . . etc
-
A class may consist of a single character.
Examples:
PLUS is +
EQUAL is =
-
Any character not contained in a class is ignored by the scanner.
-
It is also possible to define a class of illegal characters to force the scanner to produce an error message when an illegal character is encountered.
-
White Spaces:
- The class name given to spaces (blanks), tabs, carriage return, new line, and form feed.
- The ways White spaces are handled by programming languages vary.
- Some languages (C language) use them only to indicate where lexems start and end.
-
: any sequence of characters used to clarify a program code.
-
Tokens:
- Are the classes (sets) of words (lexems) that are allowed in the text of a program written in a programming language.
- The following are the tokens in most programming languages:
Identifiers: are names given to the programming language objects by the programmer.
Examples: num1, _sum in the C/C++ language.
Keywords: are words with an assigned meaning in a language
Examples: main, auto, if, while in the C/C++ language
Reserved word: words that should not be used as a user’s identifier.
In C/C++, every keyword is a reserved word; but in languages such as FORTRAN, keywords REAL and INTEGER are not reserved words.
Constants: are words representing fixed numeric or character values:
Examples: integer constants,
character constants,
real constants.
String-literals: Examples: quoted strings in the C/C++ language.
Operators: Examples: + (plus), - (minus), * (multiplication), / (division), % (modulus),
= (Assignment) in the C/C++ language
Punctuators: also known as separators.
Examples: { } [ ] , : ; = ( ) * # in the C/C++ language.
-
The rules for grouping characters (in a program source code) into words (lexems).
These rules are usually specified using
regular expressions or
context-free grammars
Strings
-
Given an alphabet V, a string over V is a finite sequence of characters of V.
Examples
-
01101 , 1010001 , 1111, 0 are strings over the alphabet {0 , 1}
-
Jo , sum , number are strings over the roman alphabet
-
The string with no symbol is called the empty string or null string
It is denoted by: or
-
The concatenation of strings s and t (over the same alphabet) is denoted by:
s.t or st
It is the string consisting of the characters of string s followed by those of string t.
Examples: 011.001 = 011001;
jo.ann = joann
-
Given a string w and a natural number n ≥ 0,
w0 = and
wn = w.w.w.w . . . w (n times) for (n >0)
Examples:
do0 =
do1 = do
do3 = dododo
Operations on Sets of Strings (Languages)
L , L1, and L2 are languages over the same alphabet.
-
The union of L1 and L2 Denoted by L1 L2 or L1 | L2
is the language that contains all the strings that belong in either L1 or L2
is the language that contains all the strings that belong to both L1 and L2
-
The concatenation of L1 and L2 Denoted by L1.L2 or L1L2
is the language that contains all the strings of the form uv where u belongs to L1 and v belongs to L2 .
-
The complement of L denoted by Not(L)
is the language that contains all the strings that do not belong to L.
-
The closure or Kleene star of L denoted by L*
is the language obtained by concatenation of zero, one or more strings of L.
L* = { } L L2 L3 . . . .
Note L+ = LL* = L L2 L3 . . . .
Examples
Let S = {a, aa, aaa } ,
T = {bb, bbb} ,
P = { ab, bb, ba}
S T = {a, aa, aaa, bb, bbb}
T P = { bb }
S.T= {abb, abbb, aabb, aabbb, aaabb, aaabbb}
P*= {, ab, bb, ba, abab, abbb, abba, bbab, bbbb, bbba, baab, babb, baba, . . . }
P+ = {ab, bb, ba, abab, abbb, abba, bbab, bbbb, bbba, baab, babb, baba, . . . }
Regular Expressions
-
Regular expressions are formal systems that can be used to specify a class of languages (set of strings) called regular sets
-
They are defined as follows:
is a regular expression denoting the empty set
is a regular expression denoting the set { }
Every character a in the alphabet is a regular expression denoting the set {a}
If A and B are regular expressions, then:
AB is a regular expression denoting the set L(A).L(B)
A | B is a regular expression denoting the set L(A) | L(B)
A* is a regular expression denoting the set L(A)*
Examples: for the following examples, the alphabet is { 0 , 1 }
-
( 0 | 1)* defines the set of all strings of 0's and 1's
-
(0 | 1)*0(0 | 1)* defines the set of all strings of 0's and 1's with at least one 0.
-
1*01*01* defines the set of all strings with exactly two 0's
-
0*1* = defines the set of all strings consisting of the NULL string and any string of 0's followed by a string of 1's.
Notes:
-
Parentheses may be used to clarify the meaning of a regular expression
-
( , ) , | , * , and are called meta-symbols of the language. If one of these symbols is in the alphabet of a language, it must be quoted.
-
Any finite set of string { s1 , s2 , s3 , s4 , . . . , sk } Can be represented by the regular expression s1 | s2 | s3 | s4 | . . . |sk
Examples:
The set of digits { 0 , 1, 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 } is represented by D = O|1|2|3|4|5|6|7|8|9
The set { 00 , 11 } is represented by P = 00|11
Notations
If A is a regular expression, then:
- A+ = A A*
- Not(A) is the regular expression denoting the language Not(L(A))
- Ak is the regular expression representing the language L(A)k
Exercises
-
Give regular expressions for the following languages:
-
Binary strings ending in 01.
-
Decimal integers divisible by 5
-
Binary strings consisting of either an odd number of 1s or an odd number of 0s.
-
Binary strings containing the string 010
-
binary strings that do not contain the string 010
2. Given the alphabet = { a, b}. Give the regular expressions for the following languages:
-
all strings with no more that three a’s
-
all strings with the number of a’s divisible by three.
-
all strings with exactly one occurrence of the substring aaa
3. Which of the following are true? Explain.
-
baa a* b* a* b*
-
(b* a*) (a* b*) = a* | b*
-
(a* b*) (b* a*) =
-
abcd (a (cd)* b)*
4. What is the language represented by each of the following regular expressions?
-
((a* a) b) | b
-
(1 | 01)+
-
1 (0 |1)* 0
-
(0 | 1)* 011(0 | 1)*
-
Which of the following strings belong to the language described by the regular expression 0*|(0*10*10*10*)*?
-
1001101 f) 1001001
-
000 g) 111
-
110101 h) 11101101
-
10001000 i) 11000101
-
011001000111
Using regular Expressions to Specify Tokens and Other Words of a Programming Language
Let D = 0|1|2|3|4|5|6|7|8|9 and L = A|B|C|D|E|F| . . . |Z
-
A positive or negative integer. Int = ( | + | -)D+
-
A fixed point decimal constant that requires a digit on both side of the decimal point:
Lit = D+.D+
-
A Pascal language real constant is either a string of digits or fixed point decimal constant:
Pascal-Real = D+ | ( D+ . D+)
= D+( | .D+)
-
A C language floating point constant is a string of digits or requires digits on either side of the decimal point:
C-Float = D+ | (D+ .) | (. D+) | (D+ . D+)
= (D+(| .)) | (D* . D+)
-
A Pascal language identifier is composed of letters and digits, with the first character a letter:
Pascal-ID = L+(L | D)*
= L (L | D)*
-
A C language identifier is composed of letters, digits and the underscore, with the first character a letter or the underscore:
C-ID = (L | _ )(L | _ | D)*
-
A C language comment C-Comment = /’*’(Not(‘*’/))*’*’/
-
A comment that begins with // and ends with the end of the line (EOL)
Comment = //(Not(EOL)*EOL
-
An identifier composed of letters, digits and the underscore, begin with a letter, end with a letter or a digit, and contain no consecutive underscores
ID = L (L | D | _ (L | D))*
= L(L | D)* ( _ (L | D)+)*)
Note:
The set { [i ]i | i 1 } = { [ ] , [[ ]] , [[[ ]]] , . . . }
of balanced brackets is not regular: It cannot be specified using a regular expression.
Exercise
Note: You may use the following regular expressions D, Z, and L to write your answers.
D = (0 | 1 | 2 | 3 | 4 | . . . | 9), Z = (1 | 2 | 3 | 4 | . . . | 9), and L = (A | B | C | . . . | Z).
-
Give a regular expression that describes the set of identifiers consisting of the letters and digits, and that start with a letter, and end with a digit.
-
Give a regular expression that describes the set of identifiers consisting of the letters, the decimal digits, and the underscore characters, and that start with a letter, and with no consecutive underscore.
-
Give a regular expression that describes the multiples of 100.
-
Give a regular expression that describes the set of identifiers composed of letters, digits, and the underscore, that begins with a letter or the underscore, and end with a letter.
-
Give a regular expression that describes C-like fixed-decimal constants with no superfluous leading or trailing zeros. (Note that a digit is not required on either side of the decimal point). That is 0.0, .25, 30. , 123.01, and 123005.0 are legal, but 00.0, 001.000 and 0002345.100 are illegal.
-
Describe in English the language generated by the regular expression: L+ ( D | L | _(D | L) )* D.
Language Recognition Devices
-
A language recognition device for a language (set of strings) described using a regular expression is an algorithm such that given the characters of a string s (one at a time from left to right) as input, it outputs "yes" if the string s belongs to the language, and "no" otherwise.
-
It is in general specified using Finite state automata, or implemented in the code of a program.
-
It is sometime necessary to add to a language recognition device the possibility to reconstruct the string from the characters read from the input.
C++ functions that you may need to write language recognition functions
int isalnum( int ch): returns TRUE if the character ch is a letter or digit and FALSE otherwise.
int isalpha( int ch): returns TRUE if the character ch is a letter and FALSE otherwise.
int isdigit( int ch): returns TRUE if the character ch is a digit and FALSE otherwise.
int isspace(int ch): returns TRUE if the character ch is a white space, and FALSE otherwise.
These functions are declared in the header file ctype.h.
Other Functions: variable ch is defined as follows: char ch; or int ch;
Function Call Description
cin.get(ch); or ch = cin.get( ); Reads the next character from the input stream into variable ch.
cin.putback(ch); Places the character in variable ch back into the input stream so that it becomes the next input character.
cin.peek (ch); Reads the next character from the input stream and assigns it into variable ch; but this character is not removed from the input stream: it will be again the next input character.
cout.put(ch); Output the character in variable ch to the output stream.
These functions are defined in the header file iostream.
int strcmp(char *strg1, char *strg2) Returns a positive value if strg1 > strg2
Returns a negative value if strg1 < strg2
Returns 0 if strg1 = strg2.
This function is defined in the header file string.h
BUFFER:
You need a buffer in your program to hold the lexeme currently being recognized by the scanner. Assume that 79 characters is the maximum length of lexemes in your programming language. The buffer must be a global variable since it must be accessed by all the language recognition devices.
For your C++ program, define the buffer as follows: char tokenBuffer[80];
REMARKS
-
The fact that reserved words look like identifiers in most programming languages complicates the specification of the tokens of a language using regular expressions.
-
A simple solution is to treat reserved words as identifiers in the specification by regular expressions, and have a table of reserved words that is searched every time an identifier is recognized. If the identifier is in the table, then it is a reserved word; otherwise it is an identifier.
Application
A language recognition device for a token repeatedly reads a character from the input stream (into an integer variable), and if the character read is part of the lexeme that it is building, it copies it into the token buffer; otherwise it does the following:
-
Use the putback(char ch) function to put that character back into the input stream.
-
Returns the code of that token if the word is recognized or the invalid code otherwise.
Example 1: FA for the regular expression: L(L|D)* _ D*
tokenType recognizer1 ( )
{
tokenType code;
int ch, i = 0;
ch = cin.get( );
if( isalpha( ch ))
{
tokenBuffer[ i++ ] = ch;
ch = cin.get( );
while( isalnum( ch ))
{
tokenBuffer[ i++ ] = ch;
ch = cin.get( );
}
if( ch == ‘_’ )
{
tokenBuffer[ i++ ] = ch;
ch = cin.get( );
while( isdigit( ch ))
{
tokenBuffer[ i++ ] = ch;
ch = cin.get( );
}
cin.putback( ch );
code = valid;
}
else // it is not an underscore
{
cin.putback( ch );
code = invalid;
}
}
else // it is not a letter
{
cin.putback( ch );
code = invalid;
}
return( code );
}
Example 2: FA for the regular expression: %+Not($#)* $#
tokenType recognizer2 ( )
{
tokenType code;
int ch, i = 0;
ch = cin.get( );
if( ch == ‘\%’ && cin.peek( ) == ‘+’)
{
tokenBuffer[ i++ ] = ch; // store the first symbol into the buffer
// read the second symbol and store it into the buffer
ch = cin.get( );
tokenBuffer[ i++ ] = ch;
ch = cin.get( );
while( !( ch == ‘$’ && cin.peek( ) == ‘#’ ) && cin.peek( ) != EOF)
{
tokenBuffer[ i++ ] = ch; // store the first symbol into the buffer
// read the second symbol and store it into the buffer
ch = cin.get( );
}
if(cin.peek( ) == EOF)
code = invalid;
else
{
tokenBuffer[ i++ ] = ch; // store the first symbol into the buffer
// read the second symbol and store it into the buffer
ch = cin.get( );
tokenBuffer[ i++ ] = ch;
code = valid;
}
}
else // it is not “%+”
{
cin.putback( ch );
code = invalid;
}
return( code );
}
Exercise
Provide the FA for each of the following regular expressions:
-
Identifiers: L(L | D)*
-
Comments: (C language comments) /’*’ Not(‘*’/)* ‘*’/
-
Real constants: D+ . D+
-
String constants: “ Not(“)* “
Write the C++ function that corresponds to the FA in question 1.
Share with your friends: |