Unit 4: Scanning and Lexical Analysis
Lexical analysis is performed by a scanner, one of the front-end components of a compiler. The foundation for lexical analysis is provided by regular grammars and finite state automata. This unit studies scanners and lexical analysis in terms of development process products: requirements, functions, design, construction, and test. The verification of a scanner is done through testing. Validation is based on the programming language specifications, and operation of the scanner as a component of the compiler or application system that uses it.
Completing this unit should take you approximately 12 hours.
Upon successful completion of this unit, you will be able to:
- Explain scanning and lexical analysis in the context of the compilation process.
- Define token and lexeme.
- Specify the functions of a scanner in performing lexical analysis.
- Describe the handling of blanks, keywords, and comments.
- Summarize the design and construction of a scanner.
4.1: Lexical Analysis Introduction and Overview
Watch from the 4-minute mark to the 30-minute mark, and from the 37-minute mark to the end. This video gives you a glimpse into lexical analysis, which will be studied in more depth in the remaining sections of this unit.
4.2: Requirements for a Scanner
Read the Basics on pages 1-2.
Read through slide 45.
4.3: Review of Regular Expressions, FSAs, and Regular Languages
Look over Chapter 2 on Lexical Analysis. If you are comfortable with this material, just review it. If you feel you are somewhat uncomfortable with this material, read or study it to become comfortable with it; it is the foundation for much of our work on compilers.
A regular expression has associated non-deterministic finite automata (NFA) that accept it.
The recognition of a regular expression in a regular language can be done in several ways: by operating on the expression directly, using an NFA, or using a DFA. Also, an automaton can be transformed to an equivalent having fewer states. The size (that is, the number of states) and the speed of the automaton determine the most efficient way to recognize a regular expression. Note in section 2.7 of the reading the time estimates for processing an expression by a DFA and by an NFA.
Regular languages include many languages and can be used for many applications, including scanners. Further, given regular languages, their union, concatenation, repetition, set difference, and intersection are also regular languages--we say that regular languages are closed under these operations. Regular languages can be expressed, equivalently, using regular expressions, NFAs, or DFAs. A key limiting characteristic of regular languages is seen from DFAs. A DFA is a finite automaton, and, thus, can remember only a finite number of symbols. Hence, a DFA cannot recognize a string of the type an b cn, for any n (because for any n it will require an infinite number of states to remember the number of a's). Most computer languages are not regular and we will need to use a larger formal language class to parse them.
4.4: Design of a Scanner
Read from page 3 to the end, including Scanner Implementation 1 and 2, and the FORTRAN I Case Study.
Read slides 46-216.
4.5: Construction of a Scanner
Read these slides.
Watch the rest of this lecture.
Read these notes. FLEX is a scanner generator. It produces a scanner, given a description of the patterns to be identified and actions to take for each token.
Read these slides.
Read section 2.9.1. Scanners are usually not written anew, but are generated by tools called scanner (or parser) generators.
Review this manual, which contains details on the Flex scanner generator. The link will take you to the introduction section of the manual. Use the links at the top of each page to navigate to the next page of the resource, reading up through section "24 Limitations". For further information, feel free to browse other parts of the manual, including the "Additional Readings" and "FAQ" sections.
4.6: Verification and Validation of a Scanner
Scanners for production compilers or software systems must be verified (by proof, demonstration, review, or test, that shows that the requirements, design, and performance specifications are met) and validated (also by proof, demonstration, review, or test that shows that the scanner satisfies the needs of its users and its role in a larger containing system--either compiler or system application). The amount of effort expended in verifying and validating a scanner is dependent on the purpose and intended use of the scanner. For both verification and validation, the description of the input language must be shown to be correct. If the scanner is generated, then the quality of the scanner depends on the reliability and effectiveness of the generator.
In this exercise, you will be introduced to LEX and YACC, which stand for Lexer (short for lexical analyzer) and Yet Another Compiler-Compiler. They are generators, i.e. programs that generate other programs, in this case, a scanner and a parser. Our focus in this Unit 4 Assessment will be on LEX, actually, FLEX - which stands for Fast Lexer. Read this tutorial on LEX; it provides a concise description of LEX and YACC. Then, complete the exercises on this page.