01-02:Structure of a Compiler

1.Lexical Analysis

Recognize words:Blank/Punctuation/Capital letter

Example: ist his ase nte nce -> is this a sentence

Lexical analysis divides program text into “words” or “tokens”

Example: if x == y then z = 1;else z = 2;

keywords、variable names、constants、operators

Once words are understood,the next step is to understand sentence structure

2.Parsing == Diagramming Sentences --> tree

Actual work: Group words together into higher level constructs

Understand sentence structure --> Understand “meaning”

3.Semantic Analysis

Compiler perform limited semantic analysis to catch inconsistencies

Programming languages define strict rules to avoid such ambiguities

Compilers perform many semantic checks besides variable bindings


Optimization has no strong counterpart in English

Automatically modify programs so that they can run faster or use less memory

Example: X = Y * 0 is the same as X = 0

valid for integers and invalid for FP(NaN * 0 = NaN)

5.Code Generation

Produces assembly code

A translation into another language

The overall structure of almost every compiler adheres to our outline

Optimization phase is very large in all modern compilers

The proportions have changed since FORTRAN

01-03:The Economy of Programming Language

Application domains have distinctive/conflicting needs.

Programmer training is the dominant cost for a programming language.

New languages tend to look like old languages.

01-04:COOL Overview

COOL:Classroom Object Oriented Language

Modern implementation:

Abstraction/Static typing/Reuse(inheritance)/Memory management…

5 programming assignments(PAs)

  • Write a COOL program
  • Lexical analysis
  • Parsing
  • Semantic analysis
  • Code generation
  • optimization(optional)
class Main{
	i : IO	<- new IO;
	//initialize to a new IO object
	main():IO{	i.out_string("Hello world!\n")	};
	//main():Object{ (new IO).out_string("Hello world!\n")	};

class Main inherits IO{
	main():Object {	self.out_string("Hello world!\n")	};

self: name of the current object when the main method runs

(In other languages self is called this)

02-02:Cool Example II

fact(i : Int) : Int {
	let fact: Int <- 1 in {
		while (not (i = 0)) loop
			fact <- fact * i;
			i <- i - 1;

02-03: Cool Example III

class List{
	item: Objects;
	init(i: Object,n: List) : List{
		item <- i;
		next <- n;
	flatten() : String {
		let string: String <-
			case item of
				i: Int => i2a(i);
				s: String => s; 
				o: Object => { abort(); "";}
		if(isvoid next) then

03-01: Lexical Analysis Part 1

lexical analysis(词法分析)

关于lexical analysis中的token

Tokens are defined often by regular expressions, which are understood by a lexical analyzer generator such as lex. The lexical analyzer (generated automatically by a tool like lex, or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. This is termed tokenizing.


  • Identifier(标识符):以字母开头、由字母或数字组合的字符串
  • Integer(整数):一个或一串数字
  • Keyword(关键字):又称为预留字(reserved word)
  • Whitespace(留白处):空白(blank)、换行(newline)、tab

lexical analysis需要根据规定来对子字符串(substrings)进行归类,向parser传递token

token = <class ,string>

03-02:Lexical Analysis Examples

1.The goal is to partition the string.This is implemented by reading left-to-right,recognizing one token at a time.

2.“Lookahead" may be required to decide where one token ends and the next token begins

PL/I Programming Language One PL/I keywords are not reserved


K	V	 K	  V		 V	  K		V	  V



  1. Partition the input string into lexemes
  2. ldentify the token of each lexeme

03-03: Regular Languages Part 1

Lexical structure = token classes

We must say what set of strings is in a token class

​ 一Use regular languages

two base regular expressions(基本正则表达式):

  • regular language
  • epsilon(空字符)

three compound regular expressions(复合正则表达式):

  • Union(合并)

    Union of the sets of strings

    A + B = {a|a∈A} ∪ {b|b∈B}
  • Concatenation(连结)

    AB = {ab|a∈A ∧ b∈B}
  • Iteration

    A* =  ∪Aⁿ 				
    //A concatenates with itself By n times

    So the empty string is always an element of a star

The regular expressions over ∑ are the smallest set of expressions including

∑ = {0,1}

1* = "" + 1 + 11 + 111 + ... = all strings of 1's

(1+0)1 = {ab|a∈1+0 ∧ b∈1} = {11,01}

0* + 1* = {0ⁿ|n≥0} ∪ {1ⁿ|n≥0}

(0 + 1)* = "" + (0+1) + (0+1)(0+1) + ... = ∑*

03-03: Regular Languages Part 2


Regular expressions specify regular languages

Five constructs

  • Two base cases: empty and 1-character strings
  • Three compound expressions: union, concatenation, iteration

03-04: Formal Languages

Def. Let ∑ be a set of characters (an alphabet).

A language over ∑ is a set of strings of characters drawn from ∑.

Alphabet = ASCII Language = C programs

L:exp -> set of Strings

把compound expressions分解为几种基本的expressions,然后compute

Why use a meaning function?

  • Makes clear what is syntax, what is semantics

  • Allows us to consider notation as a separate issue

  • Because expressions and meanings are not 1-1


Meaning is many to one!

03-05: Lexical Specifications


digit = '0' + '1' + '2' + '3' + '4' + '5' + '6' + '7' + '8' + '9'


digitdigit*		// equal to digit⁺	用windows自带的charmap


letter = [a-zA-Z]

letter(letter + string)*


' ' + '\n' + '\t'	//blank , newline , tab

(' ' + '\n' + '\t')⁺




digit  			= '0' + '1' + '2' + '3' + '4' + '5' + '6' + '7' + '8' + '9'
digits 			= digit⁺
opt_fraction 	= ('.'digits) + ε	= ('.'digits)?	//问号表示可能为空
opt_exponent 	= ('E'('+' + '-' + ε) digits) + ε
//('E'('+' + '-')? digits)?
num 			= digits opt_fraction opt_exponent

//opt_fraction == optional fraction(小数)
//opt_exponent == optional exponent(指数)

04-01: Lexical Specification


At least one:	A⁺		≡	AA*
Union:			A|B 	≡	A + B
Option:			A?		≡	A + ε
Range:	'a'+'b'+...+'z' ≡	[a-z]
Excluded range:
	complement of [a-z] ≡	[^a-z]

为每个token class制定rexp

Number		=	digit⁺
Keyword 	=	'if' + 'else' + ...
Identifier	=	letter(letter + digit)*
OpenPar		=	'('

关于Lexical Specification

  • 总是优先查找Maximal Munch(Match as long as possible)

  • 总是选择优先度最高的(Highest priority match)


if ∈ L(Keywords)
if ∈ L(Identifiers)
  • 如果没有匹配的rexp,就写一个error rexp,包含error strings,并将其置于最底下(Put it last in priority)


  • Regular expressions provide a concise notation for string patterns

  • Use in lexical analysis requires small extensions

    – To resolve ambiguities

    – To handle errors

  • Good algorithms known

    – Require only single pass over the input

    – Few operations per character (table lookup)

04-02: Finite Automata Part 1

Regular expressions = specification

Finite automata = implementation

A finite automaton consists of

  • An input alphabet Σ
  • A set of states S
  • A start state n
  • A set of accepting states F ⊆ S
  • A set of transitions state →(input) state

关于transition,从start state读取input然后转至另一个state,如果input结束(enter final state)并且进入accepting stateaccept,否则就reject(S∉F or automata gets stuck)

视频中所举的三个例子,一是start stateAaccept state1input1state A通过transitionstate B,然后input结束并accept;二是accept state2input1但没有transition,所以get stuck然后reject结束;三是input10state A先读取input中的1通过transition进入state B,然后再读取input中的0,由于没有transition所以reject结束

Language of a FA ≡ set of accepted strings

例题:A finite automaton accepting any number of 1’s followed by a single 0 (Alphabet: {0,1})

input为110:start stateAinput pointerinput中的1开始,通过transition返回state A,然后读取1返回state A,读取0进入state Binput结束并进入accepting stateaccept结束

input100start stateAinput pointerinput中的1开始,通过transition返回state A,然后读取0进入state B,再读取0,没有transition就会get stuckreject结束

04-02: Finite Automata Part 2

Another kind of transition: ε-moves

Machine can move from state A to state B without reading input

Epsilon transition definition from wikipedia

An epsilon transition (also epsilon move or lambda transition) allows an automaton to change its state spontaneously(自发地), i.e. without consuming an input symbol. It may appear in almost all kinds of nondeterministic automaton in formal language theory, in particular:

  • Nondeterministic Turing machine
  • Nondeterministic pushdown automaton
  • Nondeterministic finite automaton

Deterministic Finite Automata (DFA:确定有限状态机)

  • Exactly one transition per input per state(针对唯一的input仅有唯一的transition)
  • No ε-moves

Nondeterministic Finite Automata (NFA:不确定有限状态机)

  • Can have zero, one, or multiple transitions for one input in a given state
  • Can have ε-moves

A DFA can take only one path through the state graph per input

A NFA accepts if some choices lead to an accepting state

NFA当input结束进入final state时,如果accepting state存在于set of possible states中,就会accept


04-03: Regular Expressions into NFAs

Lexical Specification -> Regular Expressions -> NFA -> DFA -> Table-driven Implementation of DFA


A、C因为1*处transition不足直接排除,B为正确答案,D错在接受input为1后通过epsilon move返回的是start state,这样在start state可以通过epsilon move进入state 0,与题意不符

04-04: NFA to DFA

Epsilon Closure

某个state的Epsilon Closure是当前state能通过epsilon move到达的所有state的set(包括自身)

Epsilon closure for a given state X is a set of states which can be reached from the states X with only (null) or E moves including the state X itself.




首先写出start state(A)的Epsilon Closure,为ABCGHI,然后有两个transition

  1. input为1,进入state D,EC为DE,继续读取input 0,进入state F,EC为FGHBC,又重复读取input 1
  2. input为0,进入state J,EC为JH



04-05: Implementing Finite Automata

A DFA can be implemented by a 2D table T

– One dimension is “states”

– Other dimension is “input symbol”

– For every transition Si →a Sk define T[i,a] = k

横坐标为input,纵坐标为state,表格中的值k为每个state在读取对应的input以后进入的next state

i = 0;
state = 0;
	state = A[state,input[i++]];

如果重复的state过多,可以创建两个一维数组,对每个state用pointer指向对应input下的next state

NFA -> DFA conversion is key