Lexical analysis is about decompose source into a stream of tokens. A token is a class of substrings that describes all items of interest.
In order to design a lexical analyser, we may follow several steps:
- define a finite set of tokens
- describe which string belong to each token
This tells us that the implementation must do 2 things:
- classify each substring as a token
- return the value (lexeme) of the token
Regular Languages
To handle ambiguity when specifying tokens, we can use regular languages. The standard notation for regular languages is regular expressions.
Definition. Let alphabet be a set of characters. A language over is a of strings of characters drawn from .
Regular Expressions
regex can be formulated by basic elements and combonation rules.
- Atomic regex.
- Single character.
- Compound regex.
- Union.
- Intersection
- Iteration/Repetation.
Examples.
- Keywords
- Digits
- Integers , abbr.