OmniMark is a stream processing language. Most often an external file is used as a data stream, although an operating system's standard input stream or the output from some other program can also be used, for example when OmniMark is used for Common Gateway Interface (CGI) applications, its input stream comes from a web server.
In this chapter we are concerned with how OmniMark uses pattern matching techniques to process the data in a stream and we will be using input files as a stream source. An input file can be specified explicitly by name within an OmniMark program or can be identified by a filename on the command line. In either case, the file is streamed through the program and submitted to some pattern matching rules. In the following examples I will be using a file called 'timetable.dat' as input. This file contains a fragment of a university's weekly lecture and tutorial timetable and is part of the file which was used as an example in chapter 1 of this booklet. For future reference, the entire contents of this file is presented here:
EEB121 THE E/C PROFESSION: AN INTRO Subject co-ordinator: L. Harrison L Mon 1300 - 1350 S15 - 2.05 T1 Wed 1400 - 1450 C02 - 112 T2 Wed 1300 - 1350 C02 - 112 T2 Thu 1300 - 1350 S01 - 102 T1 Thu 0900 - 0950 S01 - 101 T3 Thu 1000 - 1050 S01 - 101 T3 Thu 1400 - 1450 C03 - 403 EEB322 ISSUES IN CARE & EDUCATION Subject co-ordinator: T. Simpson L Tue 0900 - 0950 S01 - 102 T1 Tue 1100 - 1250 C08 - 1.04 T2 Tue 1400 - 1550 C08 - 1.04
To explicitly open the file 'timetable.dat' within an OmniMark program we can write a process rule like this:
[Code Sample: C03T01a.xom]
001 ; submitting a file 002 003 process 004 submit file "timetable.dat"
It is often convenient to write a more general OmniMark program which submits a file named on the command line. The following program does this
[Code Sample: C03T01b.xom]
001 ; submitting a file named on the command line 002 003 process 004 submit file #command-line-names item 1
Here the identifier '#command-line-names' is a shelf (an OmniMark array, see Chapter 5, Topic 1 ) and contains all the command line arguments supplied when OmniMark is called. The qualifier 'item 1' refers to the first command line argument. This program is the equivalent to the previous one when called with the command:
omnimark -sb C03T01b.xom timetable.dat
If you are using the IDE version of OmniMark you can get the same general purpose effect as the command line version by identifying input file names in your project options. Type the program into the IDE editor, save it and then choose 'Create Project' from the 'File' menu. Once this is done, choose 'Project Options' from the 'Edit' menu. A dialog will be presented. Choose the 'Input' tab on this dialog and use the 'Browse' button to identify your input file.
The following screen shot is taken from the OmniMark IDE project options dialog and shows how the file 'timetable.dat' has been specified as an input file.
OmniMark's pattern matching facility works like a filter. The input stream flows into the program and the pattern matching rules in the program are used to catch any incoming data which matches certain conditions (patterns). It is a principle of OmniMark programming that any data which is found by any pattern matching rule is consumed or removed from the stream. Any data which is not found flows through the program and onto the output stream. This is why, in the previous topic's sample programs, the output is a complete copy of the input - there are no patten matching rules in the programs so no data is consumed.
There are a number of examples of this kind of processing in the following topics. For now, I present a couple of small samples to demonstrate the above principle. The following program takes the file 'timetable.dat' as input and consumes everything in it. There is no output from the program.
[Code Sample: C03T02a.xom]
001 ; consuming all of the input 002 003 process 004 submit file "timetable.dat" 005 006 find any
The statement on line 6 is a pattern matching rule or 'find rule'. In this case the pattern it matches is specified as 'any'. The 'any' pattern means any single character, and as each character from the timetable file comes in, it is immediately consumed by this rule. To get a clear understanding of the principle involved you should compare the results of the program 'C03T01a.xom' with the results of the above program 'C03T02a.xom'. The first one contains no find rules and consumes nothing (all input flows to output) and the second contains a find rule which consumes everything (no output at all).
A final sample is provided here. It contains a single find rule which consumes all of the digits (and only the digits) of the input file.
[Code Sample: C03T02b.xom]
001 ; consuming all the digits 002 003 process 004 submit file "timetable.dat" 005 006 find digit
The output from this program is all of the input file except the characters which represents digits:
EEB THE E/C PROFESSION: AN INTRO Subject co-ordinator: L. Harrison L Mon - S - . T Wed - C - T Wed - C - T Thu - S - T Thu - S - T Thu - S - T Thu - C - EEB ISSUES IN CARE & EDUCATION Subject co-ordinator: T. Simpson L Tue - S - T Tue - C - . T Tue - C - .
With the above principle in mind, this topic introduces some further examples of pattern matching and shows how data which matches patterns can be captured for useful purposes.
To start with, we will capture and output all the subject codes from the input file. To do this, it is first necessary to establish what pattern matches these subject codes. By inspection we can see that all subject codes are formed by three letters followed by three digits - the subject 'EEB121' is an example. We can further refine the pattern by noting that each subject code in the file 'timetable.dat' appears at the start of a new line. If we make the assumption that all sequences of three letters and three digits starting on a new line are actually subject codes then we should be able to write an OmniMark find rule to consume them.
In this case we actually want to keep the subject codes and output them rather than consume them, and we want to consume and discard everything else from the input stream. The solution is to match the pattern for subject codes with a find rule, capture the codes into a variable, output the variable and consume all other characters. The following sample program does this:
[Code Sample: C03T03a.xom]
001 ; capture and output all subject codes
002
003 process
004 submit file "timetable.dat"
005
006 find line-start (letter{3} digit{3}) => subjectCode
007 output "%x(subjectCode)%n"
008
009 find any
The crucial pattern appears on line 6. The 'line-start' pattern matches the beginning of all new lines. The expression
(letter{3} digit{3})
matches a sequence of exactly three letters 'letter{3}' followed by exactly three digits 'digit{3}'. Notice that these two patterns are grouped in a pair or brackets. Also on line 6, the action
=> subjectCode
captures the whole bracketed pattern into a pattern variable called 'subjectCode' which is output in line 7. The correct format modifier to use with pattern variables is '%x'. The find rule on line 9, finds and consumes all other characters from the input stream.
Another suitable task is to capture all the lecturer's names from the timetable file. The pattern we want to find starts on a new line, followed by the literal sequence Subject co-ordinator: then some blank space and the lecturer's name just before the end of the line. Like the previous program, we want to keep the lecturer's names and consume and discard all other text. The following sample program does the work:
[Code Sample: C03T03b.xom]
001 ; capturing lecturer's names 002 003 process 004 submit file "timetable.dat" 005 006 find line-start 007 "Subject co-ordinator:" 008 white-space+ 009 any-text+ => lectName 010 output "%x(lectName)%n" 011 012 find any
Lines 6, 7, 8 and 9 contain the pattern we are looking for. Patterns can be specified over several lines as shown. Literal strings can be found as shown on line 7. The 'white-space+' pattern on line 8 matches any spaces or tabs and the pattern 'any-text+' matches all characters except the newline. No brackets have been used in the above rule so it is just the characters matched by the 'any-text+' pattern which are captured into the 'lectName' variable.
To capture both the lecturer's name and the subject we can combine the find rules into one program. When doing this you should keep in mind that the program is responding to data events and these occur in the order the data streams into the program. It is obvious by looking at the timetable file that subject codes appear before lecturer's names so subject codes will be captured first. Suppose we want to produce output such as
L. Harrison teaches EEB121 T. Simpson teaches EEB322
which contains the lecturer's name before the subject code. In this case we will have to save the subject code into a global variable and output it after finding the corresponding lecturer. The program below actually uses two global variables which are set as the appropriate pattern is matched. The output of both data is done after each lecturer's name is captured:
[Code Sample: C03T03c.xom]
001 ; capturing names and codes
002
003 global stream name
004 global stream code
005
006 process
007 submit file "timetable.dat"
008
009 find line-start (letter{3} digit{3}) => subjectCode
010 set code to subjectCode
011
012 find line-start
013 "Subject co-ordinator:"
014 white-space+
015 any-text+ => lectName
016 set name to lectName
017
018 output "%g(name) teaches %g(code)%n"
019
020 find any
The important principle here is that although you can place find rules in any order in your OmniMark programs, the order they will fire is solely determined by the ordering of the incoming data. If you want to output values in a different order to the input you have to save early values so you can output them later. As an extreme example of the principle, if the 'find any' rule is the first rule in the program, it will consume and discard all characters and the other, more specific rules, will never fire. You should try doing this just to confirm that there is no output. The 'find any' rule is the most general rule of all. You should always place more specific rules before general ones.
You have probably gleaned that words like 'digit', 'letter', 'any-text' etc have special meanings as standardised patterns in OmniMark. These are correctly called 'character classes' and in this topic I will summarise several of the most used ones.
When a character class, such as 'letter', is used by itself it matches exactly one single character of the named class. To match one or more characters of a class we can use the '+' symbol, such as
find letter+ ; match one or more letters
To match zero or more characters in a class, we use the '*' symbol:
find digit* ; match no digits, one digit or many
To match zero or one characters of a class use '?', such as
find lc? ; match either zero or exactly 1 lower case letter
A specific number of characters in a class can be matched by specifying the number inside braces:
find any{5} ; match any sequence of 5 chars
and we can match a range of characters in a class like this
find uc{3 to 7} ; match 3, 4, 5, 6 or 7 upper case letters
The following table lists several character classes which OmniMark can understand:
| Class | Meaning |
| any | any character |
| letter | an alphabetic character |
| uc | an uppercase letter |
| lc | a lowercase letter |
| digit | a digit: 0...9 |
| %t | a tab char (ASCII 9) |
| %n | a newline char (ASCII 10: LF) |
| %r | a return char (ASCII 13: CR) |
| space | a space (ASCII 32) |
| blank | a space or tab |
| white-space | a space or tab or newline |
| any-text | any char except '%n' |
| line-start | start of a line |
| line-end | end of a line |
| word-start | start of a word |
| word-end | end of a word |
| ["ab,c"] | any one of 'a' or 'b' or ',' or 'c' |
| [any except "ab,c"] | any char other than 'a' or 'b' or ',' or 'c' |
The occurrence operators '+', '*', or '?' can be used to modify the number of characters matched by any class as can the '{}' modifiers.
To match a specific known character or sequence of characters a literal string can be used in a find rule:
find "Hello" ; match 'Hello'
To match a literal without being case sensitive we can use:
find ul"Hello" ; match 'Hello', 'HELLO', 'hElLo' etc
Suppose an input stream contains the string
The bright brown fox
How do we capture all the text up to but not including the word 'brown'? We could try this:
[Code Sample: C03T05a.xom]
001 002 process 003 submit "the bright brown fox" 004 005 find [any except "b"]+ => pat "brown" 006 output "%x(pat)%n" 007 008 find any
in an attempt to match all characters up to but not including the 'b' of 'brown'. This would output:
right
can you see why?
A second attempt might temp us to try to capture all text up to the entire word 'brown', thus:
[Code Sample: C03T05b.xom]
001 process 002 submit "the bright brown fox" 003 004 find [any except "brown"]+ => pat "brown" 005 output "%x(pat)%n" 006 007 find any
for which the output is:
ight
can you see why?
The problem here is that the custom-made characters class
[any except "brown"]+
matches all characters until any one of the characters in 'brown' is found. What we actually want to do is to match up to the entire word 'brown' and for this we can use the 'lookahead' pattern. Here is a correct solution:
[Code Sample: C03T05c.xom]
001 process 002 submit "the bright brown fox" 003 004 find ((lookahead not "brown") any)+ => pat "brown" 005 output "%x(pat)%n" 006 007 find any
The pattern
((lookahead not "brown") any)+
consumes all characters until the entire sequence 'brown' is located.
The tasks at the end of this chapter invite you to try devising find rules for yourself and if you understand the principles of pattern matching described in this chapter and refer to the table of character classes above you should be able to solve them. As with all programming languages, successful programming in OmniMark requires plenty of practice. You can only really learn the material if you are prepared to write programs, have them fail, find out why they failed, debug them, and write them again correctly. The following list encapsulates a modest strategy for pattern matching programming with OmniMark:
Using as input, the file 'timetable.dat' listed in topic 3.1 above, write a program to print the names of all the subjects.
Write a program to output all the rooms used on Mondays from the timetable.
Write a program to output all the days, times and rooms lectures are given. Lectures are identified with the letter 'L' at the start of a line.
Write a program to count how many tutorials occur on Thursdays. Tutorials start with a 'T' and on Thursdays are followed by the string "Thu". Note that your are not required to output any timetable information, just the number of times Thursday tutorials occur.
The audio-visual officer needs to contact all the lecturers who use room 'S01 - 101' to inform them that the projector is away for repair. Write a program to output a report listing all the names of all the lecturers who use this room with the day and the time the room is used by them. Place a heading over the report.
A subject name occurs after the subject code and extends to the end of the current line.
[Code Sample: C03S01.xom]
001 ; find subject names
002
003 process
004 submit file "timetable.dat"
005
006 find line-start
007 letter{3} digit{3}
008 white-space+
009 any-text+ => subName "%n"
010
011 output "%x(subName)%n"
012
013 find any
Rooms used on Mondays appear on the end of the lines which contain "Mon"
[Code Sample: C03S02.xom]
001 ; Monday's rooms
002
003 process
004 submit file "timetable.dat"
005
006 find line-start
007 any{2} ;; 'L ', 'T2' etc
008 white-space+
009 "Mon"
010 [any except letter]+ ; up to the first letter
011 ; of a room number
012 any-text+ => room
013 "%n"
014
015 output "%x(room)%n"
016
017 find any
001 ; find lecture details 002 003 process 004 submit file "timetable.dat" 005 006 find line-start 007 "L" 008 white-space+ 009 any-text+ => lectDetails 010 "%n" 011 012 output "%x(lectDetails)%n" 013 014 find any
001 ; count Thursday tutes
002
003 global counter thTutes initial {0}
004
005 process
006 submit file "timetable.dat"
007
008 process-end
009 output "There are %d(thTutes) tutorials on Thursdays%n"
010
011 find line-start
012 "T" digit
013 white-space+
014 "Thu"
015
016 increment thTutes
017
018 find any
001 ; Lecturers who use S01 - 101
002
003 global stream lectName
004
005 process-start
006 output "Lecturers and times for use with S01 - 101%n"
007
008 process
009 submit file "timetable.dat"
010
011 find line-start
012 "Subject co-ordinator:"
013 white-space+
014 any-text+ => aName "%n"
015 set lectName to aName
016
017 find line-start
018 any{2}
019 white-space+
020 letter{3} => day
021 white-space+
022 (digit+ white-space+ "-" white-space+ digit+) => time
023 white-space+
024 "S01 - 101"
025 output "%g(lectName) on %x(day) at %x(time)%n"
026
027
028 find any