Lexical analysis is about decompose source into a stream of tokens. A token is a class of substrings that describes all items of interest.

In order to design a lexical analyser, we may follow several steps:

  1. define a finite set of tokens
  2. describe which string belong to each token

This tells us that the implementation must do 2 things:

  • classify each substring as a token
  • return the value (lexeme) of the token

Regular Languages

To handle ambiguity when specifying tokens, we can use regular languages. The standard notation for regular languages is regular expressions.

Definition. Let alphabet Σ\Sigma be a set of characters. A language over Σ\Sigma is a of strings of characters drawn from Σ\Sigma.

Regular Expressions

regex can be formulated by basic elements and combonation rules.

  • Atomic regex.
    • L(ϵ)={""}L(\epsilon)=\{\texttt{""}\}
    • Single character. L(’c’)={"c"}L(\texttt{'c'})=\{ \texttt{"c"} \}
  • Compound regex.
    • Union.
    • Intersection
    • Iteration/Repetation. L(A)=i0AiL(A^\ast)=\cup_{i\ge 0} A^i

Examples.

  • Keywords ="else"+"if"+"begin"=\texttt{"else"+"if"+"begin"}
  • Digits ="0"+"1"++"9"=\texttt{"0"+"1"+}\dots\texttt{+"9"}
    • Integers =digitdigit=\text{digit}\,\text{digit}^\ast, abbr. A+A^+

Non-/Deterministic Finite Automaton