Lexical Analysis

Lexical analysis is about decompose source into a stream of tokens. A token is a class of substrings that describes all items of interest.

In order to design a lexical analyser, we may follow several steps:

define a finite set of tokens
describe which string belong to each token

This tells us that the implementation must do 2 things:

classify each substring as a token
return the value (lexeme) of the token

Regular Languages

To handle ambiguity when specifying tokens, we can use regular languages. The standard notation for regular languages is regular expressions.

Definition. Let alphabet $\Sigma$ be a set of characters. A language over $\Sigma$ is a of strings of characters drawn from $\Sigma$ .

Regular Expressions

regex can be formulated by basic elements and combonation rules.

Atomic regex.
- $L(\epsilon)=\{\texttt{""}\}$
- Single character. $L(\texttt{'c'})=\{ \texttt{"c"} \}$
Compound regex.
- Union.
- Intersection
- Iteration/Repetation. $L(A^\ast)=\cup_{i\ge 0} A^i$

Examples.
Keywords $=\texttt{"else"+"if"+"begin"}$
Digits $=\texttt{"0"+"1"+}\dots\texttt{+"9"}$
Integers $=\text{digit}\,\text{digit}^\ast$ , abbr. $A^+$

Lexical Analysis

Regular Languages

Non-/Deterministic Finite Automaton