Lexical elements of programming languages

Download 71.89 Kb.

Date	29.01.2017
Size	71.89 Kb.
	#11293

LEXICAL ELEMENTS OF PROGRAMMING LANGUAGES

The lexical elements of a programming language consist of the following:

Character set (vocabulary /alphabet):

Two approaches are taken when deciding on the character set of a programming language:

Choose all the characters deemed necessary

Examples: APL, ALGOL 90

Drawbacks:

- use special I/O equipments or

- make changes to the published language when it is used on a computer.

Choose only the characters commonly available with current I/O devices

Examples: FORTRAN, COBOL, PASCAL

- The most commonly used character sets are:the ASCII character set and the EBCDIC character set.

Nonprintable characters may be represented by their decimal equivalent or some other form

Examples: escape sequences in C, and EOF for end of file character.

In the definition of programming languages, characters that are treated identically by the compiler are often grouped into classes.

Examples:

Digits = (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)

Letters = (A | B | C | ... | Z)

White spaces = blanks, tabs, carriage return . . . etc

A class may consist of a single character.

Examples:

PLUS is +

EQUAL is =

Any character not contained in a class is ignored by the scanner.
It is also possible to define a class of illegal characters to force the scanner to produce an error message when an illegal character is encountered.

White Spaces:

- The class name given to spaces (blanks), tabs, carriage return, new line, and form feed.

- The ways White spaces are handled by programming languages vary.

- Some languages (C language) use them only to indicate where lexems start and end.

Comments: any sequence of characters used to clarify a program code.
Tokens:

- Are the classes (sets) of words (lexems) that are allowed in the text of a program written in a programming language.

- The following are the tokens in most programming languages:

Identifiers: are names given to the programming language objects by the programmer.

Examples: num1, _sum in the C/C++ language.

Keywords: are words with an assigned meaning in a language

Examples: main, auto, if, while in the C/C++ language

Reserved word: words that should not be used as a user’s identifier.

In C/C++, every keyword is a reserved word; but in languages such as FORTRAN, keywords REAL and INTEGER are not reserved words.

Constants: are words representing fixed numeric or character values:

Examples: integer constants,

character constants,

real constants.

String-literals: Examples: quoted strings in the C/C++ language.

Operators: Examples: + (plus), - (minus), * (multiplication), / (division), % (modulus),

= (Assignment) in the C/C++ language

Punctuators: also known as separators.

Examples: { } [ ] , : ; = ( ) * # in the C/C++ language.

The rules for grouping characters (in a program source code) into words (lexems).

These rules are usually specified using

regular expressions or

context-free grammars

Strings

Given an alphabet V, a string over V is a finite sequence of characters of V.

Examples

01101 , 1010001 , 1111, 0 are strings over the alphabet {0 , 1}
Jo , sum , number are strings over the roman alphabet

The string with no symbol is called the empty string or null string

It is denoted by:  or 

The concatenation of strings s and t (over the same alphabet) is denoted by:

s.t or st

It is the string consisting of the characters of string s followed by those of string t.

Examples: 011.001 = 011001;

jo.ann = joann

Given a string w and a natural number n ≥ 0,

w⁰ =  and

wⁿ = w.w.w.w . . . w (n times) for (n >0)

Examples:

do⁰ = 

do¹ = do

do³ = dododo

Operations on Sets of Strings (Languages)

L , L₁, and L₂ are languages over the same alphabet.

The union of L₁ and L₂ Denoted by L₁  L₂or L₁ | L₂

is the language that contains all the strings that belong in either L₁ or L₂

The intersection of L₁ and L₂ Denoted by L₁ L₂

is the language that contains all the strings that belong to both L₁ and L₂

The concatenation of L₁ and L₂ Denoted by L₁.L₂or L₁L₂

is the language that contains all the strings of the form uv where u belongs to L₁ and v belongs to L₂ .

The complement of L denoted by Not(L)

is the language that contains all the strings that do not belong to L.

The closure or Kleene star of L denoted by L^*

is the language obtained by concatenation of zero, one or more strings of L.

L^* = {  }  L  L²  L³  . . . .

Note L⁺ = LL^* = L  L²  L³  . . . .

Examples

Let S = {a, aa, aaa } ,

T = {bb, bbb} ,

P = { ab, bb, ba}

S  T = {a, aa, aaa, bb, bbb}

T  P = { bb }

S.T= {abb, abbb, aabb, aabbb, aaabb, aaabbb}

P*= {, ab, bb, ba, abab, abbb, abba, bbab, bbbb, bbba, baab, babb, baba, . . . }

P+ = {ab, bb, ba, abab, abbb, abba, bbab, bbbb, bbba, baab, babb, baba, . . . }

Regular Expressions

Regular expressions are formal systems that can be used to specify a class of languages (set of strings) called regular sets
They are defined as follows:

 is a regular expression denoting the empty set

 is a regular expression denoting the set {  }

Every character a in the alphabet is a regular expression denoting the set {a}

If A and B are regular expressions, then:

AB is a regular expression denoting the set L(A).L(B)

A | B is a regular expression denoting the set L(A) | L(B)

A^* is a regular expression denoting the set L(A)^*

Examples: for the following examples, the alphabet is { 0 , 1 }

( 0 | 1)* defines the set of all strings of 0's and 1's
(0 | 1)*0(0 | 1)* defines the set of all strings of 0's and 1's with at least one 0.
1*01*01* defines the set of all strings with exactly two 0's
0*1* = defines the set of all strings consisting of the NULL string  and any string of 0's followed by a string of 1's.

Notes:

Parentheses may be used to clarify the meaning of a regular expression
( , ) , | , * , and  are called meta-symbols of the language. If one of these symbols is in the alphabet of a language, it must be quoted.
Any finite set of string { s₁ , s₂ , s₃ , s₄ , . . . , s_k } Can be represented by the regular expression s₁ | s₂ | s₃ | s₄ | . . . |s_k

Examples:

The set of digits { 0 , 1, 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 } is represented by D = O|1|2|3|4|5|6|7|8|9

The set { 00 , 11 } is represented by P = 00|11

Notations

If A is a regular expression, then:

- A⁺= A A^*

- Not(A) is the regular expression denoting the language Not(L(A))

- A^k is the regular expression representing the language L(A)^k

Exercises

Give regular expressions for the following languages:

Binary strings ending in 01.
Decimal integers divisible by 5
Binary strings consisting of either an odd number of 1s or an odd number of 0s.
Binary strings containing the string 010
binary strings that do not contain the string 010

2. Given the alphabet  = { a, b}. Give the regular expressions for the following languages:

all strings with no more that three a’s
all strings with the number of a’s divisible by three.
all strings with exactly one occurrence of the substring aaa

3. Which of the following are true? Explain.

baa  a* b* a* b*
(b* a*)  (a* b*) = a* | b*
(a* b*)  (b* a*) = 
abcd  (a (cd)* b)*

4. What is the language represented by each of the following regular expressions?

((a* a) b) | b
(1 | 01)+
1 (0 |1)* 0
(0 | 1)* 011(0 | 1)*

Which of the following strings belong to the language described by the regular expression 0*|(0*10*10*10*)*?

1001101 f) 1001001
000 g) 111
110101 h) 11101101
10001000 i) 11000101
011001000111

Using regular Expressions to Specify Tokens and Other Words of a Programming Language

Let D = 0|1|2|3|4|5|6|7|8|9 and L = A|B|C|D|E|F| . . . |Z

A positive or negative integer. Int = ( | + | -)D⁺
A fixed point decimal constant that requires a digit on both side of the decimal point:

Lit = D⁺.D⁺

A Pascal language real constant is either a string of digits or fixed point decimal constant:

Pascal-Real = D⁺| ( D⁺. D⁺)

= D⁺( | .D⁺)

A C language floating point constant is a string of digits or requires digits on either side of the decimal point:

C-Float = D⁺| (D⁺.) | (. D⁺)| (D⁺. D⁺)

= (D⁺(| .)) | (D^*. D⁺)

A Pascal language identifier is composed of letters and digits, with the first character a letter:

Pascal-ID = L⁺(L | D)^*

= L (L | D)^*

A C language identifier is composed of letters, digits and the underscore, with the first character a letter or the underscore:

C-ID = (L | _ )(L | _ | D)^*

A C language comment C-Comment = /’*’(Not(‘*’/))^*’*’/
A comment that begins with // and ends with the end of the line (EOL)

Comment = //(Not(EOL)^*EOL

An identifier composed of letters, digits and the underscore, begin with a letter, end with a letter or a digit, and contain no consecutive underscores

ID = L (L | D | _ (L | D))*

= L(L | D)* ( _ (L | D)+)*)

Note:

The set { [ⁱ ]ⁱ | i 1 } = { [ ] , [[ ]] , [[[ ]]] , . . . }

of balanced brackets is not regular: It cannot be specified using a regular expression.

Exercise

Note: You may use the following regular expressions D, Z, and L to write your answers.

D = (0 | 1 | 2 | 3 | 4 | . . . | 9), Z = (1 | 2 | 3 | 4 | . . . | 9), and L = (A | B | C | . . . | Z).

Give a regular expression that describes the set of identifiers consisting of the letters and digits, and that start with a letter, and end with a digit.
Give a regular expression that describes the set of identifiers consisting of the letters, the decimal digits, and the underscore characters, and that start with a letter, and with no consecutive underscore.
Give a regular expression that describes the multiples of 100.
Give a regular expression that describes the set of identifiers composed of letters, digits, and the underscore, that begins with a letter or the underscore, and end with a letter.
Give a regular expression that describes C-like fixed-decimal constants with no superfluous leading or trailing zeros. (Note that a digit is not required on either side of the decimal point). That is 0.0, .25, 30. , 123.01, and 123005.0 are legal, but 00.0, 001.000 and 0002345.100 are illegal.
Describe in English the language generated by the regular expression: L⁺ ( D | L | _(D | L) )* D.

Language Recognition Devices

A language recognition device for a language (set of strings) described using a regular expression is an algorithm such that given the characters of a string s (one at a time from left to right) as input, it outputs "yes" if the string s belongs to the language, and "no" otherwise.
It is in general specified using Finite state automata, or implemented in the code of a program.
It is sometime necessary to add to a language recognition device the possibility to reconstruct the string from the characters read from the input.

C++ functions that you may need to write language recognition functions

int isalnum( int ch): returns TRUE if the character ch is a letter or digit and FALSE otherwise.

int isalpha( int ch): returns TRUE if the character ch is a letter and FALSE otherwise.

int isdigit( int ch): returns TRUE if the character ch is a digit and FALSE otherwise.

int isspace(int ch): returns TRUE if the character ch is a white space, and FALSE otherwise.

These functions are declared in the header file ctype.h.

Other Functions: variable ch is defined as follows: char ch; or int ch;

Function Call Description

cin.get(ch); or ch = cin.get( ); Reads the next character from the input stream into variable ch.

cin.putback(ch); Places the character in variable ch back into the input stream so that it becomes the next input character.

cin.peek (ch); Reads the next character from the input stream and assigns it into variable ch; but this character is not removed from the input stream: it will be again the next input character.

cout.put(ch); Output the character in variable ch to the output stream.

These functions are defined in the header file iostream.

int strcmp(char *strg1, char *strg2) Returns a positive value if strg1 > strg2

Returns a negative value if strg1 < strg2

Returns 0 if strg1 = strg2.

This function is defined in the header file string.h

BUFFER:

You need a buffer in your program to hold the lexeme currently being recognized by the scanner. Assume that 79 characters is the maximum length of lexemes in your programming language. The buffer must be a global variable since it must be accessed by all the language recognition devices.

For your C++ program, define the buffer as follows: char tokenBuffer[80];

REMARKS

The fact that reserved words look like identifiers in most programming languages complicates the specification of the tokens of a language using regular expressions.
A simple solution is to treat reserved words as identifiers in the specification by regular expressions, and have a table of reserved words that is searched every time an identifier is recognized. If the identifier is in the table, then it is a reserved word; otherwise it is an identifier.

Application

A language recognition device for a token repeatedly reads a character from the input stream (into an integer variable), and if the character read is part of the lexeme that it is building, it copies it into the token buffer; otherwise it does the following:

Use the putback(char ch) function to put that character back into the input stream.
Returns the code of that token if the word is recognized or the invalid code otherwise.

Example 1: FA for the regular expression: L(L|D)* _ D*
tokenType recognizer1 ( )

{

tokenType code;

int ch, i = 0;

ch = cin.get( );

if( isalpha( ch ))

{

tokenBuffer[ i++ ] = ch;

ch = cin.get( );

while( isalnum( ch ))

{

tokenBuffer[ i++ ] = ch;

ch = cin.get( );

}

if( ch == ‘_’ )

{

tokenBuffer[ i++ ] = ch;

ch = cin.get( );

while( isdigit( ch ))

{

tokenBuffer[ i++ ] = ch;

ch = cin.get( );

}

cin.putback( ch );

code = valid;

}

else // it is not an underscore

{

cin.putback( ch );

code = invalid;

}

else // it is not a letter

{

cin.putback( ch );

code = invalid;

}

return( code );

}

Example 2: FA for the regular expression: %+Not($#)* $#
tokenType recognizer2 ( )

{

tokenType code;

int ch, i = 0;

ch = cin.get( );

if( ch == ‘\%’ && cin.peek( ) == ‘+’)

{

tokenBuffer[ i++ ] = ch; // store the first symbol into the buffer

// read the second symbol and store it into the buffer

ch = cin.get( );

tokenBuffer[ i++ ] = ch;
ch = cin.get( );

while( !( ch == ‘$’ && cin.peek( ) == ‘#’ ) && cin.peek( ) != EOF)

{

tokenBuffer[ i++ ] = ch; // store the first symbol into the buffer

// read the second symbol and store it into the buffer

ch = cin.get( );

}

if(cin.peek( ) == EOF)

code = invalid;

else

{

tokenBuffer[ i++ ] = ch; // store the first symbol into the buffer

// read the second symbol and store it into the buffer

ch = cin.get( );

tokenBuffer[ i++ ] = ch;

code = valid;

}

else // it is not “%+”

{

cin.putback( ch );

code = invalid;

}

return( code );

}
Exercise

Provide the FA for each of the following regular expressions:

Identifiers: L(L | D)*
Comments: (C language comments) /’*’ Not(‘*’/)* ‘*’/
Real constants: D+ . D+
String constants: “ Not(“)* “

Write the C++ function that corresponds to the FA in question 1.

Directory: ~ndjatou
~ndjatou -> Introduction I reasons for Studying Concepts of Programming Languages

Download 71.89 Kb.

Share with your friends: