Compilers

Introduction

Compiler is Tool:

which translate notations from one system to another usually from source code ( high level code ) to machine code (source code, target code or low-level code ).

• A compiler is a program that translates a sentence
a). from a source language (e.g. Java, Scheme, LATEX)
b). into a target language (e.g. JVM, Intel x86, PDF)
c). while preserving its meaning in the process
• Compiler design has a long history (FORTRAN 1958)
a). lots of experience on how to structure compilers
b). lots of existing designs to study (many freely available)
c). take CS 152: Compiler Design for some of the details. . .

We generally use natural language to communicate and and Programming Languages to speak with Computers.

Components of a Compiler:

1. Lexical Analysis

2. Syntax Analysis

3. Semantic Analysis

4. Intermediate Code Generations

5. Code Optimization

6. Code Generations

The Scanner

a.) Performs Lexical Analysis

b.) Collects characters sequence into tokens.

Example:

a[index]=4+2

Tokens::

a identifier

[ left bracket

index identifier

] right bracket

= assignment

4 number

+ plus sign

2 number

The Scanner may also ::

1. Enter Identifiers into the Symbol Table

2. Enter Literals (numeric constants and strings ) into the Literal Table

Illustration:

1. Lexical Analysis::

She kicks the Ball

Token Generation:

She
kicks
the
Ball

who performs this ?

LEXICAL ANALYSIS

2. Syntax Analysis::

The Parser

1. It performs syntax analysis

2. Build a parse tree or syntax tree

Parse Tree for a[index]=4+2

An abstract syntax tree is a more concise version of a parse tree.

3. Semantic Analyzer::

This attach meaning of Tokens;

For example:

4. Intermediate Code Generations::

5. Code Optimization::

6. Code Generations::

Error Handling Requirements::

A Compiler must handle syntax and semantic errors , but not run time errors ( weather a run time error will occur is million dollar question ).
Sometimes a compiler is required to generate a code to catch run time errors and handle them in some graceful way (either with or without handling ). This, too , is often difficult.

Major Data Structures::

Ø Phases of a compiler

1. LEXICAL PHASE:

Action of parsing source program into proper classes is the task of lexical analyzer. The program is scanned and separated into basic elements known as tokens. The basic elements like identifier, literals are placed into tables as other phases recognize the use and meaning of elements further information is entered into these tables like data types, length, storage classes etc.

2. SYNTAX PHASE:

Compiler interprets the meaning of the constant syntax analyzer is also known as parsing. The basic function of the syntax phase is to recognize the major constructs of language and to calculate the appropriate action routines that will generate the intermediate form or matrix for these constructs.

3. INTERPRETATION PHASE:

Once the program is correct according to syntax, the compiler generates a intermediate code , in this phase , the intermediate code is compiler directed .This intermediate co ode may be in form of parse tree ,postfix notation and three address code .

4. OPTIMISATION:

Any program is set to be optimized, if it is consumes less memory and the compiler time is fast. The code optimizer receives intermediate code and optimize them considering the like memory, execution time and other. This phase removes the redundant code.

5. STORAGE ASSIGNMENT:

It is the process of allocating memory to various resources. The memory can be allocated to various literals, storage, symbols, machine-OP etc. also the processing environment may need some memory. All these memory allocation are performed by compiler in this phase.

1. CODE GENERATION:

It is the last phase of compiler. This phase receives optimized intermediate code and generates the code for execution. This code generated generally will be relocatable machine code or assembly code. This phase generates the executed code. It receives the input from intermediate code generator from code optimizer in case of optimizing compilers. The input to code generator must be error free, it must have undergone lexical analysis, syntax and other intermediate phases. The target program may be in the form of

1.Assembly language

2.Relocate machine code

3.Absolute machine language

The absolute machine code is 1 which is ready to be executed. It is

Linker and loaded by the loader, thus making it ready for execution.

Lex Theory

During the first phase the compiler reads the input and converts strings in the source to tokens. With regular expressions we can specify patterns to lex so it can generate code that will allow it to scan and match strings in the input. Each pattern in the input to lex has an associated action. Typically an action returns a token that represents the matched string for subsequent use by the parser. Initially we will simply print the matched string rather than return a token value.

The following represents a simple pattern, composed of a regular expression, that scans for identifiers. Lex will read this pattern and produce C code for a lexical analyzer that scans for identifiers.

letter(letter|digit)*

This pattern matches a string of characters that begins with a single letter followed by zero or more letters or digits. This example nicely illustrates operations allowed in regular expressions:

repetition, expressed by the "*" operator
alternation, expressed by the "|" operator
concatenation

Any regular expression expressions may be expressed as a finite state automaton (FSA). We can represent an FSA using states, and transitions between states. There is one start state and one or more final or accepting states.

Figure 3: Finite State Automaton

In Figure 3 state 0 is the start state and state 2 is the accepting state. As characters are read we make a transition from one state to another. When the first letter is read we transition to state 1. We remain in state 1 as more letters or digits are read. When we read a character other than a letter or digit we transition to accepting state 2. Any FSA may be expressed as a computer program. For example, our 3-state machine is easily programmed:

start:  goto state0

state0: read c
        if c = letter goto state1
        goto state0

state1: read c
        if c = letter goto state1
        if c = digit goto state1
        goto state2

state2: accept string

This is the technique used by lex. Regular expressions are translated by lex to a computer program that mimics an FSA. Using the next input character and current state the next state is easily determined by indexing into a computer-generated state table.

Now we can easily understand some of lex’s limitations. For example, lex cannot be used to recognize nested structures such as parentheses. Nested structures are handled by incorporating a stack. Whenever we encounter a "(" we push it on the stack. When a ")" is encountered we match it with the top of the stack and pop the stack. However lex only has states and transitions between states. Since it has no stack it is not well suited for parsing nested structures. Yacc augments an FSA with a stack and can process constructs such as parentheses with ease. The important thing is to use the right tool for the job. Lex is good at pattern matching. Yacc is appropriate for more challenging tasks.

Yacc Theory

Grammars for yacc are described using a variant of Backus Naur Form (BNF). This technique, pioneered by John Backus and Peter Naur, was used to describe ALGOL60. A BNF grammar can be used to express context-free languages. Most constructs in modern programming languages can be represented in BNF. For example, the grammar for an expression that multiplies and adds numbers is

1    E -> E + E
2    E -> E * E
3    E -> id

Three productions have been specified. Terms that appear on the left-hand side (lhs) of a production, such as E, are nonterminals. Terms such as id (identifier) are terminals (tokens returned by lex) and only appear on the right-hand side (rhs) of a production. This grammar specifies that an expression may be the sum of two expressions, the product of two expressions, or an identifier. We can use this grammar to generate expressions:

E -> E * E             (r2)
  -> E * z             (r3) 
  -> E + E * z         (r1)
  -> E + y * z         (r3)
  -> x + y * z         (r3)

At each step we expanded a term and replace the lhs of a production with the corresponding rhs. The numbers on the right indicate which rule applied. To parse an expression we a need to do the reverse operation. Instead of starting with a single nonterminal (start symbol) and generating an expression from a grammar we need to reduce an expression to a single nonterminal. This is known as bottom-up or shift-reduce parsing and uses a stack for storing terms. Here is the same derivation but in reverse order:

 1   . x + y * z     shift
 2   x . + y * z     reduce(r3)
 3   E . + y * z     shift
 4   E + . y * z     shift
 5   E + y . * z     reduce(r3)
 6   E + E . * z     shift
 7   E + E * . z     shift
 8   E + E * z .     reduce(r3)
 9   E + E * E .     reduce(r2)   emit multiply
10   E + E .         reduce(r1)   emit add
11   E .             accept

Terms to the left of the dot are on the stack while remaining input is to the right of the dot. We start by shifting tokens onto the stack. When the top of the stack matches the rhs of a production we replace the matched tokens on the stack with the lhs of the production. In other words the matched tokens of the rhs are popped off the stack, and the lhs of the production is pushed on the stack. The matched tokens are known as a handle and we are reducing the handle to the lhs of the production. This process continues until we have shifted all input to the stack and only the starting non terminal remains on the stack. In step 1 we shift the x to the stack. Step 2 applies rule r3 to the stack to change x to E. We continue shifting and reducing until a single non terminal, the start symbol, remains in the stack. In step 9, when we reduce rule r2, we emit the multiply instruction. Similarly the add instruction is emitted in step 10. Consequently multiply has a higher precedence than addition.

Consider the shift at step 6. Instead of shifting we could have reduced and apply rule r1. This would result in addition having a higher precedence than multiplication. This is known as a shift-reduce conflict. Our grammar is ambiguous because there is more than one possible derivation that will yield the expression. In this case operator precedence is affected. As another example, association in the rule

E -> E + E

is ambiguous, for we may re curse on the left or the right. To remedy the situation, we could rewrite the grammar or supply Yacc with directives that indicate which operator has precedence. The latter method is simpler and will be demonstrated in the practice section.

The following grammar has a reduce-reduce conflict. With an id on the stack we may reduce to T or E.

E -> T
E -> id
T -> id

Yacc takes a default action when there is a conflict. For shift-reduce conflicts Yacc will shift. For reduce-reduce conflicts it will use the first rule in the listing. It also issues a warning message whenever a conflict exists. The warnings may be suppressed by making the grammar unambiguous. Several methods for removing ambiguity will be presented in subsequent sections.

System Programming

Monday, 7 April 2014

Compilers

Compilers

Lex Theory

Yacc Theory

No comments:

Post a Comment

Blog Archive