OmniMark Programming Principles

www.serverside.com.au

Chapter 3
Pattern Matching and General File Processing.


[Back to the General Index] [Back to the Chapter Summary]

This chapter covers:

Topic Index

3.1: Using a file as an input stream

OmniMark is a stream processing language. Most often an external file is used as a data stream, although an operating system's standard input stream or the output from some other program can also be used, for example when OmniMark is used for Common Gateway Interface (CGI) applications, its input stream comes from a web server.

In this chapter we are concerned with how OmniMark uses pattern matching techniques to process the data in a stream and we will be using input files as a stream source. An input file can be specified explicitly by name within an OmniMark program or can be identified by a filename on the command line. In either case, the file is streamed through the program and submitted to some pattern matching rules. In the following examples I will be using a file called 'timetable.dat' as input. This file contains a fragment of a university's weekly lecture and tutorial timetable and is part of the file which was used as an example in chapter 1 of this booklet. For future reference, the entire contents of this file is presented here:

EEB121 THE E/C PROFESSION: AN INTRO
Subject co-ordinator: L. Harrison
L     Mon  1300 - 1350 S15 - 2.05
T1    Wed  1400 - 1450 C02 - 112
T2    Wed  1300 - 1350 C02 - 112
T2    Thu  1300 - 1350 S01 - 102
T1    Thu  0900 - 0950 S01 - 101
T3    Thu  1000 - 1050 S01 - 101
T3    Thu  1400 - 1450 C03 - 403

EEB322 ISSUES IN CARE & EDUCATION
Subject co-ordinator: T. Simpson
L     Tue  0900 - 0950 S01 - 102
T1    Tue  1100 - 1250 C08 - 1.04
T2    Tue  1400 - 1550 C08 - 1.04

3.1.1 Explicitly naming an input file

To explicitly open the file 'timetable.dat' within an OmniMark program we can write a process rule like this:

[Code Sample: C03T01a.xom]

001  ; submitting a file
002  
003  process
004    submit file "timetable.dat"

3.1.2 Using a command line name as an input file

It is often convenient to write a more general OmniMark program which submits a file named on the command line. The following program does this

[Code Sample: C03T01b.xom]

001  ; submitting a file named on the command line
002  
003  process
004    submit file #command-line-names item 1

Here the identifier '#command-line-names' is a shelf (an OmniMark array, see Chapter 5, Topic 1 ) and contains all the command line arguments supplied when OmniMark is called. The qualifier 'item 1' refers to the first command line argument. This program is the equivalent to the previous one when called with the command:

  omnimark -sb C03T01b.xom timetable.dat

If you are using the IDE version of OmniMark you can get the same general purpose effect as the command line version by identifying input file names in your project options. Type the program into the IDE editor, save it and then choose 'Create Project' from the 'File' menu. Once this is done, choose 'Project Options' from the 'Edit' menu. A dialog will be presented. Choose the 'Input' tab on this dialog and use the 'Browse' button to identify your input file.

The following screen shot is taken from the OmniMark IDE project options dialog and shows how the file 'timetable.dat' has been specified as an input file.

Topic List

3.2: How pattern matching works

OmniMark's pattern matching facility works like a filter. The input stream flows into the program and the pattern matching rules in the program are used to catch any incoming data which matches certain conditions (patterns). It is a principle of OmniMark programming that any data which is found by any pattern matching rule is consumed or removed from the stream. Any data which is not found flows through the program and onto the output stream. This is why, in the previous topic's sample programs, the output is a complete copy of the input - there are no patten matching rules in the programs so no data is consumed.

There are a number of examples of this kind of processing in the following topics. For now, I present a couple of small samples to demonstrate the above principle. The following program takes the file 'timetable.dat' as input and consumes everything in it. There is no output from the program.

[Code Sample: C03T02a.xom]

001  ; consuming all of the input
002  
003  process
004    submit file "timetable.dat"
005  
006  find any

The statement on line 6 is a pattern matching rule or 'find rule'. In this case the pattern it matches is specified as 'any'. The 'any' pattern means any single character, and as each character from the timetable file comes in, it is immediately consumed by this rule. To get a clear understanding of the principle involved you should compare the results of the program 'C03T01a.xom' with the results of the above program 'C03T02a.xom'. The first one contains no find rules and consumes nothing (all input flows to output) and the second contains a find rule which consumes everything (no output at all).

A final sample is provided here. It contains a single find rule which consumes all of the digits (and only the digits) of the input file.

[Code Sample: C03T02b.xom]

001  ; consuming all the digits
002  
003  process
004    submit file "timetable.dat"
005  
006  find digit

The output from this program is all of the input file except the characters which represents digits:

EEB THE E/C PROFESSION: AN INTRO
Subject co-ordinator: L. Harrison
L     Mon   -  S - .
T    Wed   -  C - 
T    Wed   -  C - 
T    Thu   -  S - 
T    Thu   -  S - 
T    Thu   -  S - 
T    Thu   -  C - 

EEB ISSUES IN CARE & EDUCATION
Subject co-ordinator: T. Simpson
L     Tue   -  S - 
T    Tue   -  C - .
T    Tue   -  C - .

Topic List

3.3: Finding and processing information in streams

With the above principle in mind, this topic introduces some further examples of pattern matching and shows how data which matches patterns can be captured for useful purposes.

3.3.1 Capturing subject codes

To start with, we will capture and output all the subject codes from the input file. To do this, it is first necessary to establish what pattern matches these subject codes. By inspection we can see that all subject codes are formed by three letters followed by three digits - the subject 'EEB121' is an example. We can further refine the pattern by noting that each subject code in the file 'timetable.dat' appears at the start of a new line. If we make the assumption that all sequences of three letters and three digits starting on a new line are actually subject codes then we should be able to write an OmniMark find rule to consume them.

In this case we actually want to keep the subject codes and output them rather than consume them, and we want to consume and discard everything else from the input stream. The solution is to match the pattern for subject codes with a find rule, capture the codes into a variable, output the variable and consume all other characters. The following sample program does this:

[Code Sample: C03T03a.xom]

001  ; capture and output all subject codes
002  
003  process
004    submit file "timetable.dat"
005  
006  find line-start (letter{3} digit{3}) => subjectCode
007    output "%x(subjectCode)%n"
008  
009  find any

The crucial pattern appears on line 6. The 'line-start' pattern matches the beginning of all new lines. The expression

(letter{3} digit{3})

matches a sequence of exactly three letters 'letter{3}' followed by exactly three digits 'digit{3}'. Notice that these two patterns are grouped in a pair or brackets. Also on line 6, the action

=> subjectCode

captures the whole bracketed pattern into a pattern variable called 'subjectCode' which is output in line 7. The correct format modifier to use with pattern variables is '%x'. The find rule on line 9, finds and consumes all other characters from the input stream.

3.3.2 Capturing lecturer's names

Another suitable task is to capture all the lecturer's names from the timetable file. The pattern we want to find starts on a new line, followed by the literal sequence Subject co-ordinator: then some blank space and the lecturer's name just before the end of the line. Like the previous program, we want to keep the lecturer's names and consume and discard all other text. The following sample program does the work:

[Code Sample: C03T03b.xom]

001  ; capturing lecturer's names
002  
003  process
004    submit file "timetable.dat"
005  
006  find line-start
007       "Subject co-ordinator:"
008        white-space+
009       any-text+ => lectName
010    output "%x(lectName)%n"
011  
012  find any

Lines 6, 7, 8 and 9 contain the pattern we are looking for. Patterns can be specified over several lines as shown. Literal strings can be found as shown on line 7. The 'white-space+' pattern on line 8 matches any spaces or tabs and the pattern 'any-text+' matches all characters except the newline. No brackets have been used in the above rule so it is just the characters matched by the 'any-text+' pattern which are captured into the 'lectName' variable.

3.3.3 Finding multiple patterns.

To capture both the lecturer's name and the subject we can combine the find rules into one program. When doing this you should keep in mind that the program is responding to data events and these occur in the order the data streams into the program. It is obvious by looking at the timetable file that subject codes appear before lecturer's names so subject codes will be captured first. Suppose we want to produce output such as

L. Harrison teaches EEB121
T. Simpson teaches EEB322

which contains the lecturer's name before the subject code. In this case we will have to save the subject code into a global variable and output it after finding the corresponding lecturer. The program below actually uses two global variables which are set as the appropriate pattern is matched. The output of both data is done after each lecturer's name is captured:

[Code Sample: C03T03c.xom]

001  ; capturing names and codes
002  
003  global stream name
004  global stream code
005  
006  process
007    submit file "timetable.dat"
008  
009  find line-start (letter{3} digit{3}) => subjectCode
010    set code to subjectCode
011  
012  find line-start
013       "Subject co-ordinator:"
014        white-space+
015       any-text+ => lectName
016    set name to lectName
017  
018    output "%g(name) teaches %g(code)%n"
019  
020  find any

The important principle here is that although you can place find rules in any order in your OmniMark programs, the order they will fire is solely determined by the ordering of the incoming data. If you want to output values in a different order to the input you have to save early values so you can output them later. As an extreme example of the principle, if the 'find any' rule is the first rule in the program, it will consume and discard all characters and the other, more specific rules, will never fire. You should try doing this just to confirm that there is no output. The 'find any' rule is the most general rule of all. You should always place more specific rules before general ones.

Topic List

3.4: Character classes and special patterns

You have probably gleaned that words like 'digit', 'letter', 'any-text' etc have special meanings as standardised patterns in OmniMark. These are correctly called 'character classes' and in this topic I will summarise several of the most used ones.

3.4.1 Occurrence operators

When a character class, such as 'letter', is used by itself it matches exactly one single character of the named class. To match one or more characters of a class we can use the '+' symbol, such as

  find letter+  ; match one or more letters

To match zero or more characters in a class, we use the '*' symbol:

  find digit*  ; match no digits, one digit or many

To match zero or one characters of a class use '?', such as

  find lc? ; match either zero or exactly 1 lower case letter

A specific number of characters in a class can be matched by specifying the number inside braces:

  find any{5}  ; match any sequence of 5 chars

and we can match a range of characters in a class like this

  find uc{3 to 7}  ; match 3, 4, 5, 6 or 7 upper case letters

3.4.2 Some character classes

The following table lists several character classes which OmniMark can understand:

Class Meaning
any any character
letter an alphabetic character
uc an uppercase letter
lc a lowercase letter
digit a digit: 0...9
%t a tab char (ASCII 9)
%n a newline char (ASCII 10: LF)
%r a return char (ASCII 13: CR)
space a space (ASCII 32)
blank a space or tab
white-space a space or tab or newline
any-text any char except '%n'
line-start start of a line
line-end end of a line
word-start start of a word
word-end end of a word
["ab,c"] any one of 'a' or 'b' or ',' or 'c'
[any except "ab,c"] any char other than 'a' or 'b' or ',' or 'c'

The occurrence operators '+', '*', or '?' can be used to modify the number of characters matched by any class as can the '{}' modifiers.

3.4.3 Literal characters

To match a specific known character or sequence of characters a literal string can be used in a find rule:

  find "Hello"  ; match 'Hello'

To match a literal without being case sensitive we can use:

  find ul"Hello" ; match 'Hello', 'HELLO', 'hElLo' etc

Topic List

3.5: Looking ahead

Suppose an input stream contains the string

The bright brown fox

How do we capture all the text up to but not including the word 'brown'? We could try this:

[Code Sample: C03T05a.xom]

001  
002  process
003    submit "the bright brown fox"
004  
005  find [any except "b"]+ => pat "brown"
006    output "%x(pat)%n"
007  
008  find any

in an attempt to match all characters up to but not including the 'b' of 'brown'. This would output:

right 

can you see why?

A second attempt might temp us to try to capture all text up to the entire word 'brown', thus:

[Code Sample: C03T05b.xom]

001  process
002    submit "the bright brown fox"
003  
004  find [any except "brown"]+ => pat "brown"
005    output "%x(pat)%n"
006  
007  find any

for which the output is:

ight

can you see why?

The problem here is that the custom-made characters class

[any except "brown"]+

matches all characters until any one of the characters in 'brown' is found. What we actually want to do is to match up to the entire word 'brown' and for this we can use the 'lookahead' pattern. Here is a correct solution:

[Code Sample: C03T05c.xom]

001  process
002    submit "the bright brown fox"
003  
004  find ((lookahead not "brown") any)+ => pat "brown"
005    output "%x(pat)%n"
006  
007  find any

The pattern

((lookahead not "brown") any)+

consumes all characters until the entire sequence 'brown' is located.

Topic List

3.6: Successful OmniMark programming

The tasks at the end of this chapter invite you to try devising find rules for yourself and if you understand the principles of pattern matching described in this chapter and refer to the table of character classes above you should be able to solve them. As with all programming languages, successful programming in OmniMark requires plenty of practice. You can only really learn the material if you are prepared to write programs, have them fail, find out why they failed, debug them, and write them again correctly. The following list encapsulates a modest strategy for pattern matching programming with OmniMark:

Topic List


Tasks

Task 1

Using as input, the file 'timetable.dat' listed in topic 3.1 above, write a program to print the names of all the subjects.

Task 2

Write a program to output all the rooms used on Mondays from the timetable.

Task 3

Write a program to output all the days, times and rooms lectures are given. Lectures are identified with the letter 'L' at the start of a line.

Task 4

Write a program to count how many tutorials occur on Thursdays. Tutorials start with a 'T' and on Thursdays are followed by the string "Thu". Note that your are not required to output any timetable information, just the number of times Thursday tutorials occur.

Task 5

The audio-visual officer needs to contact all the lecturers who use room 'S01 - 101' to inform them that the projector is away for repair. Write a program to output a report listing all the names of all the lecturers who use this room with the day and the time the room is used by them. Place a heading over the report.


Sample Solutions

Solution 1

A subject name occurs after the subject code and extends to the end of the current line.

[Code Sample: C03S01.xom]

001  ; find subject names
002  
003  process
004    submit file "timetable.dat"
005  
006  find line-start
007       letter{3} digit{3}
008       white-space+
009       any-text+ => subName "%n"
010  
011    output "%x(subName)%n"
012  
013  find any

Solution 2

Rooms used on Mondays appear on the end of the lines which contain "Mon"

[Code Sample: C03S02.xom]

001  ; Monday's rooms
002  
003  process
004    submit file "timetable.dat"
005  
006  find line-start
007       any{2}  ;; 'L ', 'T2' etc
008       white-space+
009       "Mon"
010       [any except letter]+  ; up to the first letter
011                             ; of a room number
012       any-text+ => room
013       "%n"
014  
015    output "%x(room)%n"
016  
017  find any

Solution 3
[Code Sample: C03S03.xom]
001  ; find lecture details
002  
003  process
004    submit file "timetable.dat"
005  
006  find line-start
007       "L"
008       white-space+
009       any-text+ => lectDetails
010       "%n"
011  
012    output "%x(lectDetails)%n"
013  
014  find any

Solution 4
[Code Sample: C03S04.xom]
001  ; count Thursday tutes
002  
003  global counter thTutes initial {0}
004  
005  process
006    submit file "timetable.dat"
007  
008  process-end
009    output "There are %d(thTutes) tutorials on Thursdays%n"
010  
011  find line-start
012       "T" digit
013       white-space+
014       "Thu"
015  
016   increment thTutes
017  
018  find any

Solution 5
[Code Sample: C03S05.xom]
001  ; Lecturers who use S01 - 101
002  
003  global stream lectName
004  
005  process-start
006    output "Lecturers and times for use with S01 - 101%n"
007  
008  process
009    submit file "timetable.dat"
010  
011  find line-start
012       "Subject co-ordinator:"
013       white-space+
014       any-text+ => aName "%n"
015    set lectName to aName
016  
017  find line-start
018       any{2}
019       white-space+
020       letter{3} => day
021       white-space+
022       (digit+ white-space+ "-" white-space+ digit+) => time
023       white-space+
024       "S01 - 101"
025    output "%g(lectName) on %x(day) at %x(time)%n"
026       
027  
028  find any