IceWalkers.com - Linux Software downloads and news
Name : Password :
Linux SoftwareLinux RPMLinux HowtosLink UsAboutAdvertise

Lex and YACC primer/HOWTO

Search Howtos :Match :
Next Previous Contents

3. Lex

The program Lex generates a so called `Lexer'. This is a function that takes a stream of characters as its input, and whenever it sees a group of characters that match a key, takes a certain action. A very simple example:

%{
#include <stdio.h>
%}

%%
stop    printf("Stop command received\n");
start   printf("Start command received\n");
%%

The first section, in between the %{ and %} pair is included directly in the output program. We need this, because we use printf later on, which is defined in stdio.h.

Sections are separated using '%%', so the first line of the second section starts with the 'stop' key. Whenever the 'stop' key is encountered in the input, the rest of the line (a printf() call) is executed.

Besides 'stop', we've also defined 'start', which otherwise does mostly the same.

We terminate the code section with '%%' again.

To compile Example 1, do this:

lex example1.l
cc lex.yy.c -o example1 -ll

NOTE: If you are using flex, instead of lex, you may have to change '-ll' to '-lfl' in the compilation scripts. RedHat 6.x and SuSE need this, even when you invoke 'flex' as 'lex'!

This will generate the file 'example1'. If you run it, it waits for you to type some input. Whenever you type something that is not matched by any of the defined keys (ie, 'stop' and 'start') it's output again. If you enter 'stop' it will output 'Stop command received';

Terminate with a EOF (^D).

You may wonder how the program runs, as we didn't define a main() function. This function is defined for you in libl (liblex) which we compiled in with the -ll command.

3.1 Regular expressions in matches

This example wasn't very useful in itself, and our next one won't be either. It will however show how to use regular expressions in Lex, which are massively useful later on.

Example 2:

%{
#include <stdio.h>
%}

%%
[0123456789]+           printf("NUMBER\n");
[a-zA-Z][a-zA-Z0-9]*    printf("WORD\n");
%%

This Lex file describes two kinds of matches (tokens): WORDs and NUMBERs. Regular expressions can be pretty daunting but with only a little work it is easy to understand them. Let's examine the NUMBER match:

[0123456789]+

This says: a sequence of one or more characters from the group 0123456789. We could also have written it shorter as:

[0-9]+

Now, the WORD match is somewhat more involved:

[a-zA-Z][a-zA-Z0-9]*

The first part matches 1 and only 1 character that is between 'a' and 'z', or between 'A' and 'Z'. In other words, a letter. This initial letter then needs to be followed by zero or more characters which are either a letter or a digit. Why use an asterisk here? The '+' signifies 1 or more matches, but a WORD might very well consist of only one character, which we've already matched. So the second part may have zero matches, so we write a '*'.

This way, we've mimicked the behaviour of many programming languages which demand that a variable name *must* start with a letter, but can contain digits afterwards. In other words, 'temperature1' is a valid name, but '1temperature' is not.

Try compiling Example 2, lust like Example 1, and feed it some text. Here is a sample session:

$ ./example2
foo
WORD

bar
WORD

123
NUMBER

bar123
WORD

123bar
NUMBER
WORD

You may also be wondering where all this whitespace is coming from in the output. The reason is simple: it was in the input, and we don't match on it anywhere, so it gets output again.

The Flex manpage documents its regular expressions in detail. Many people feel that the perl regular expression manpage (perlre) is also very useful, although Flex does not implement everything perl does.

Make sure that you do not create zero length matches like '[0-9]*' - your lexer might get confused and start matching empty strings repeatedly.

3.2 A more complicated example for a C like syntax

Let's say we want to parse a file that looks like this:

logging {
        category lame-servers { null; };
        category cname { null; };
};

zone "." {
        type hint;
        file "/etc/bind/db.root";
};

We clearly see a number of categories (tokens) in this file:

  • WORDs, like 'zone' and 'type'
  • FILENAMEs, like '/etc/bind/db.root'
  • QUOTEs, like those surrounding the filename
  • OBRACEs, {
  • EBRACEs, }
  • SEMICOLONs, ;

The corresponding Lex file is Example 3:

%{
#include <stdio.h>
%}

%%
[a-zA-Z][a-zA-Z0-9]*    printf("WORD ");
[a-zA-Z0-9\/.-]+        printf("FILENAME ");
\"                      printf("QUOTE ");
\{                      printf("OBRACE ");
\}                      printf("EBRACE ");
;                       printf("SEMICOLON ");
\n                      printf("\n");
[ \t]+                  /* ignore whitespace */;
%%

When we feed our file to the program this Lex file generates (using example3.compile), we get:

WORD OBRACE 
WORD FILENAME OBRACE WORD SEMICOLON EBRACE SEMICOLON 
WORD WORD OBRACE WORD SEMICOLON EBRACE SEMICOLON 
EBRACE SEMICOLON 

WORD QUOTE FILENAME QUOTE OBRACE 
WORD WORD SEMICOLON 
WORD QUOTE FILENAME QUOTE SEMICOLON 
EBRACE SEMICOLON 

When compared with the configuration file mentioned above, it is clear that we have neatly 'Tokenized' it. Each part of the configuration file has been matched, and converted into a token.

And this is exactly what we need to put YACC to good use.

3.3 What we've seen

We've seen that Lex is able to read arbitrary input, and determine what each part of the input is. This is called 'Tokenizing'.


Next Previous Contents
Search Howtos :Match :
Transmission 1.41 beta 2
Lightweight, yet powerful BitTorrent client
X-Moto 0.5.0
2D motocross platform game
Gdm 2.25.1
Reimplementation of the well known xdm program.
Linux Kernel 2.6 2.6.28-rc7
Linux Kernel
Linux Kernel 2.4 2.4.37
Linux Kernel
RIP 7.3
Small linux system for the purpose of system booting or repairing
GEdit 2.25.1
Small but powerful text editor
VLC media player 0.9.7
Cross-platform media player and streaming server
GNOME 2.25.2
GNOME desktop environment
WebGUI 7.6.5
A fully featured content management system.
Free IT Magazines, White Papers, eBooks, and more !
Dr. Dobb's Journal

Dr. Dobb's Journal enables programmers to write the most efficient and sophisticated programs and help in daily programming quandaries.

The 7 Things that IT Security Professionals MUST KNOW!

Gain key insight into security problem and find the safest means to protect your technological assets.

Database Trends and Applications

Provides timely coverage of the technology, intelligence and insight needed to plan, implement and manage information-rich projects.

Linux Software Map
Find Linux RPM
Best Rated Linux Software
Most Rated Linux Software
Linux Distributions
Linux Howtos
Quick Survey

Please take our survey and help us improve our website to serve you better.

Thank you.
Linux Software
Linux / IT Resources
Site Resources
Google
Privacy Policy
Contact Us
Submit Software
Advertising info