IceWalkers.com - Linux Software downloads and news
Name : Password :
Linux SoftwareLinux RPMLinux HowtosLink UsAboutAdvertise

Lex and YACC primer/HOWTO

Search Howtos :Match :
Next Previous Contents

3. Lex

The program Lex generates a so called `Lexer'. This is a function that takes a stream of characters as its input, and whenever it sees a group of characters that match a key, takes a certain action. A very simple example:

%{
#include <stdio.h>
%}

%%
stop    printf("Stop command received\n");
start   printf("Start command received\n");
%%

The first section, in between the %{ and %} pair is included directly in the output program. We need this, because we use printf later on, which is defined in stdio.h.

Sections are separated using '%%', so the first line of the second section starts with the 'stop' key. Whenever the 'stop' key is encountered in the input, the rest of the line (a printf() call) is executed.

Besides 'stop', we've also defined 'start', which otherwise does mostly the same.

We terminate the code section with '%%' again.

To compile Example 1, do this:

lex example1.l
cc lex.yy.c -o example1 -ll

NOTE: If you are using flex, instead of lex, you may have to change '-ll' to '-lfl' in the compilation scripts. RedHat 6.x and SuSE need this, even when you invoke 'flex' as 'lex'!

This will generate the file 'example1'. If you run it, it waits for you to type some input. Whenever you type something that is not matched by any of the defined keys (ie, 'stop' and 'start') it's output again. If you enter 'stop' it will output 'Stop command received';

Terminate with a EOF (^D).

You may wonder how the program runs, as we didn't define a main() function. This function is defined for you in libl (liblex) which we compiled in with the -ll command.

3.1 Regular expressions in matches

This example wasn't very useful in itself, and our next one won't be either. It will however show how to use regular expressions in Lex, which are massively useful later on.

Example 2:

%{
#include <stdio.h>
%}

%%
[0123456789]+           printf("NUMBER\n");
[a-zA-Z][a-zA-Z0-9]*    printf("WORD\n");
%%

This Lex file describes two kinds of matches (tokens): WORDs and NUMBERs. Regular expressions can be pretty daunting but with only a little work it is easy to understand them. Let's examine the NUMBER match:

[0123456789]+

This says: a sequence of one or more characters from the group 0123456789. We could also have written it shorter as:

[0-9]+

Now, the WORD match is somewhat more involved:

[a-zA-Z][a-zA-Z0-9]*

The first part matches 1 and only 1 character that is between 'a' and 'z', or between 'A' and 'Z'. In other words, a letter. This initial letter then needs to be followed by zero or more characters which are either a letter or a digit. Why use an asterisk here? The '+' signifies 1 or more matches, but a WORD might very well consist of only one character, which we've already matched. So the second part may have zero matches, so we write a '*'.

This way, we've mimicked the behaviour of many programming languages which demand that a variable name *must* start with a letter, but can contain digits afterwards. In other words, 'temperature1' is a valid name, but '1temperature' is not.

Try compiling Example 2, lust like Example 1, and feed it some text. Here is a sample session:

$ ./example2
foo
WORD

bar
WORD

123
NUMBER

bar123
WORD

123bar
NUMBER
WORD

You may also be wondering where all this whitespace is coming from in the output. The reason is simple: it was in the input, and we don't match on it anywhere, so it gets output again.

The Flex manpage documents its regular expressions in detail. Many people feel that the perl regular expression manpage (perlre) is also very useful, although Flex does not implement everything perl does.

Make sure that you do not create zero length matches like '[0-9]*' - your lexer might get confused and start matching empty strings repeatedly.

3.2 A more complicated example for a C like syntax

Let's say we want to parse a file that looks like this:

logging {
        category lame-servers { null; };
        category cname { null; };
};

zone "." {
        type hint;
        file "/etc/bind/db.root";
};

We clearly see a number of categories (tokens) in this file:

  • WORDs, like 'zone' and 'type'
  • FILENAMEs, like '/etc/bind/db.root'
  • QUOTEs, like those surrounding the filename
  • OBRACEs, {
  • EBRACEs, }
  • SEMICOLONs, ;

The corresponding Lex file is Example 3:

%{
#include <stdio.h>
%}

%%
[a-zA-Z][a-zA-Z0-9]*    printf("WORD ");
[a-zA-Z0-9\/.-]+        printf("FILENAME ");
\"                      printf("QUOTE ");
\{                      printf("OBRACE ");
\}                      printf("EBRACE ");
;                       printf("SEMICOLON ");
\n                      printf("\n");
[ \t]+                  /* ignore whitespace */;
%%

When we feed our file to the program this Lex file generates (using example3.compile), we get:

WORD OBRACE 
WORD FILENAME OBRACE WORD SEMICOLON EBRACE SEMICOLON 
WORD WORD OBRACE WORD SEMICOLON EBRACE SEMICOLON 
EBRACE SEMICOLON 

WORD QUOTE FILENAME QUOTE OBRACE 
WORD WORD SEMICOLON 
WORD QUOTE FILENAME QUOTE SEMICOLON 
EBRACE SEMICOLON 

When compared with the configuration file mentioned above, it is clear that we have neatly 'Tokenized' it. Each part of the configuration file has been matched, and converted into a token.

And this is exactly what we need to put YACC to good use.

3.3 What we've seen

We've seen that Lex is able to read arbitrary input, and determine what each part of the input is. This is called 'Tokenizing'.


Next Previous Contents
Search Howtos :Match :
ImageMagick 6.6.0-7
ImageMagick image processing studio
Phorum 5.2.15
Web based discussion software written in PHP.
Gtkmm 2.19.7
C++ interface for popular GUI library GTK+
Sylpheed 3.0.1
Mail User Agent based on GTK+
PhpMyAdmin 3.3.1
Php front-end to MySQL administration
NVidia driver 195.36.15
Linux unified nVidia driver
Monkey HTTP Daemon 0.10.0-rc4
Monkey is a small and fast web server for linux
GTK2 2.18.9
GUI Toolkit
GLib2 2.22.5
Library of useful routines for C programming
WebGUI 7.8.15
A fully featured content management system.
Free IT Magazines, White Papers, eBooks, and more !
Oracle Magazine

Contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more.

Vulnerability Management for Dummies

Get all the Facts and See How to Implement a Successful Vulnerability Management Program.

Website Magazine

Has tapped premier talent in the Internet industry for our content and each and every issue will contain practical advice and insights for website owners.

Linux Software Map
Find Linux RPM
Best Rated Linux Software
Most Rated Linux Software
Linux Distributions
Linux Howtos
Quick Survey

Please take our survey and help us improve our website to serve you better.

Thank you.
Linux Software
Linux / IT Resources
Site Resources
Google
Privacy Policy
Contact Us
Submit Software
Advertising info