Lexing

Agenda

High level goal: processing text
Symbols vs. meaning
Formal language definitions
Parallels between formal languages and computation
Finite state automata and regular languages
Regular expressions
Deterministics vs. non-deterministic finite automata (DFAs vs. NFAs)
Compiling regular expressions to NFAs
NFA to DFA conversion

Big picture view of a compiler

Input: program source code
Output: machine code

Compiler vs. interpreter

Interpreter

Compiler

Interpreter (with source)

Compiler (with source)

Compiler phases

Front-end
"Middle-end"
Back-end
LLVM Compiler Infrastructure

Front-end

Language-processing
- Remember regexes, grammar from discrete? (If you took it)
Type checking
- Did I assign an array to a float?

Back-end

Code generation
- Equivalent machine code for any valid input program
Optimization
- Choose equivalent machine code that's faster, smaller, etc.

"Middle-end"

Intermediate code
- Like machine code
- Without machine specifics
- Lots of different designs

Symbols vs. meaning

1 + 2 * 3

6 / 2(1 + 2)

一加二乘三

What gives symbols meaning?

Moon vs. finger pointing at the moon
"The map is not the territory"

the compiler itself defines meaning
defines input language in terms of output language
(why does output language have meaning?)

ASCII vs. machine code

od
gcc -S
ASCII table

Formal languages

Definition

A (potentially) infinite set of
- finite strings
- from a finite alphabet of symbols

Defining languages

Concatenating strings
Joining sets

A toy example

Alphabet: "a" and "b"
Language: one a followed by any number of b's

a's and b's

Matched parentheses

(^n )^n

Binary numbers

Hex numbers

Identifiers

Automata

Kleene closure

Regular expressions

String pattern matching

phone numbers
email addresses
programming language tokens

How would you write this program?

Search for an email address in a text file?

Example

Sequence

Area code is three digits

Alternation

A digit is either a 0, 1, 2, .., 9

Parentheses are just used to make order of operations explicit, just like in arithmetic

Repetition

Any number of characters before the @ sign

Optional elements

E.g., country code

Wild cards

Allow any character (or some subset of characters)

More examples

Regular expression language

concatenation, e.g., ab
alternation, e.g., a|b
Kleene closure, e.g., a*

The order of operations, from highest to lowest, is Kleene closure, concatenation, and alternation.

One way to remember order of operations, is that alternation is like addition or logical or, concatenation is like multiplication or logical and, and Kleene closure is like exponentiation.

Regular expressions in practice

Lots of syntactic sugar available
- python library (perl syntax)

Matching against regular expressions

Given an expression (ab|d)*
Can hand-write a program for specific regular expression
But we can also automate their generation

Finite automata

States and transitions between states
Example: turnstile

By N/D - Link CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=131513928

Turnstile automaton

Turnstile states: locked and unlocked
Turnstile transitions: push and coin
Transition move from state to state
- E.g., coin causes unlock, push reutrns to locked

640px-Turnstile_state_machine_colored.svg.png

By Chetvorno - Link - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=20269475

Finite state automata are a model of computation

Corresponds to regular languages
- Any regular expression can be recognized by a finite automaton and vice-versa
AKA
- Finite state machines, Finite automaton, State machine

Here's our first abstract machine model

Takeaway here: if we can frame the problem we want to solve as a finite statement machine, then we already have a language, regular expressions, to write a solution in.

Instead of trying to create a de novo algorithm, we can map the problem into the abstract problem of finite state transitions or regular language pattern matching.

Chomsky hierarchy

640px-Chomsky-hierarchy.svg.png

By J. Finkelstein - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9405226

Chomsky hierarchy table

Equivalence between language and computation

Language	Computation
Regular	Finite automata, regex
Context-free	Parsers, push-down automata (PDA)
Context-sensitive	Parsers, non-deterministic PDAs
Recursively enumerable	Turing machine

Regular languages
- finite automata, regular expressions
Context-free languages
- Syntactic structures Parsers, push-down automata
Context-sensitive
- Syntax depends on surrounding contet
- (Fancier) parsers, non-deterministic push-down automata
Recursively enumerable
- Recognized by a Turing machine

Benefits of abstract computational models

Matching regular expressions is solved
Map your problem to a regular expression
Use an existing tool to compute solutions for you

Researchers have spent time automation the solution to this abstract problem, so that you don't have to repeat it.

Just like a compiler takes a programming language and translates it to assembly, a regular expression.

Then you can frame your problem as a regular expression, write it, and a regular expression compiler will turn this into code for you.

Applications of finite state machines

Traffic lights
Turnstiles
Vending machines
Lexers

String pattern matching is equivalent to many automation tasks

Examples from gaming

Abstract machine model

Formal language: potentially infinite set of strings
Each string drawn from a finite alphabet
Each string element itself is finite

Definition of automata

Finite alphabet of symbols (like regular languages)
Finite set of states
State transition function
Initial state and final state(s)

https://en.wikipedia.org/wiki/Finite-state_machine#Mathematical_model

Graphical representation with state diagrams

States: circles
Transitions: labeled arrows
Starting state: in-arrow
Accepting states: doubled circles

Example: identifiers

Letters and digits are character classes
- Makes drawing easier
"other" is all characters besides lettersdigits
- Represents teh decision based on a lookahead
- Matches longest identifier sequence

More complex automaton: numbers

What kinds of numbers can be represented here?
What is an equivalent regular expression?

Nondeterministic vs. Deterministic

Deterministic: one state at a time
Nondeterministic: mulitples states at once
- Same symbol, multiple transitions
- Epsilon (empty string) transitions

Examples

Demo: diagrams

L(L|D)*
(a|b)*abb
aa*|bb*

(a|b)*abb

motivate using NFAs and $\epsilon$ to make it easier to try multiple possibilities

attempt to construct intuitively, then see systematic method next

Transition tables

Can use table instead of diagram
Convenient for implementation
DFAs vs. NFAs

Demo: transition tables

L(L|D)*
(a|b)*abb
aa*|bb*

do a table driven version of (a|b)(a|b)*

one column per letter of the alphabet

one row per state

In our compiler

Use finite state machines to automate lexing
One regular expression per token
flex compiles regular expressions to a C program
Output recognizes tokens matching the regular expressions

Automate lexing by generating automata

Specific tokens via regular expressions
Existing algorithm to convert regx to automata
- This is what flex does
Two methods
- Simulate an NFA
- Convert an NFA to a DFA (subset construction)

Convert via subset construction

NFA can be multiple states at once
Finite automata means finite number
Key observation: there are finite number of combinations of NFA states
- How many subsets of states are there in a finite set?

The powerset tell us how many combinations of finite states that are possible.

Translate each regex in order of operations

Convert each subexpression to an NFA
Combine NFAs for each subexpression
Each regex operation corresponds to an NFA template
- Concatenation
- Alternation
- Kleene closure

Concatenation

Alternation

Kleene closure

NFA example

Demo: converting regex to NFA

(a|b)*abb
aa*|bb*
((a|bc)b)*

(a | b)*abb

much easier with NFAs

Construct DFA from an NFA systematically

Each DFA state created from subset of NFA states
- NFA can be multiple states at once
"Simulate" being in multiple states using a single state
The multiple states are a subset of the NFA states
Create the DFA by calling each subset a single DFA state

Sketch of the subset construction algorithm

Start at the NFA's starting state
Group all states reachable by $\epsilon$ (epsilon)
- This is the $ε$-closure/
- Call this group of the states the initial state
For each symbol s in the alphabet (which is finite)
- Get all state that s transitions to
- Find the $ε$-closure of those states
- Call this group of states one DFA state
Repeat for all combinations states and symbols
Stop when all are covered

Demo: converting NFA to DFA

(a|b)*abb
aa*|bb*
((a|bc)b)*

Conclusion

We can automatically generate lexers
Regular expressions correspond to automata
- Implemented with transitions tables, if statements, etc.
Simpler to generate NFAs from regular expressions
Subset construction to convert NFA to DFA
- Alternative: simulate an NFA

Lexing COP-5621

Agenda

Big picture view of a compiler

Compiler vs. interpreter

Interpreter

Compiler

Interpreter (with source)

Compiler (with source)

Compiler phases

Front-end

Back-end

"Middle-end"

Symbols vs. meaning

1 + 2 * 3

6 / 2(1 + 2)

一加二乘三

What gives symbols meaning?

ASCII vs. machine code

Formal languages

Definition

Defining languages

A toy example

Matched parentheses

Binary numbers

Hex numbers

Identifiers

Automata

Kleene closure

Regular expressions

How would you write this program?

Example

Sequence

Alternation

Repetition

Optional elements

Wild cards

More examples

Regular expression language

Regular expressions in practice

Matching against regular expressions

Finite automata

Turnstile automaton

Finite state automata are a model of computation

Chomsky hierarchy

Chomsky hierarchy table

Benefits of abstract computational models

Applications of finite state machines

Examples from gaming

Abstract machine model

Definition of automata

Graphical representation with state diagrams

Example: identifiers

More complex automaton: numbers

Nondeterministic vs. Deterministic

Examples

Demo: diagrams

Transition tables

Demo: transition tables

In our compiler

Automate lexing by generating automata

Convert via subset construction

Translate each regex in order of operations

Concatenation

Alternation

Kleene closure

NFA example

Demo: converting regex to NFA

Construct DFA from an NFA systematically

Sketch of the subset construction algorithm

Demo: converting NFA to DFA

Conclusion

Lexing
COP-5621