RFCP logo

REBOL for COBOL programmers

Parsing by examples

Date written: 14-SEP-2018
Date revised: 14-SEP-2018

The "parse" function in REBOL can be a bit hard to understand.
This document is not an attempt to explain it. Instead, it is a
collection of examples that we hope will become large enough so
that if you have a question about parsing you will be able to
find another person's solution and modify that for your own use.

Contents:

1. Target audience and references
2. Introduction
3. Examples
3.1 Finding input names on an html form
3.2 Script documentation a la powershell
3.3 Dissecting a CSV file
3.4 Parsing on something besides the usual delimiters
3.5 Counting leading spaces
3.6 Dividing text on the linefeed
3.7 Character type testing
3.8 COBOL word validation
3.9 Delimited substring
3.10 Log parsing example 1
3.11 Scanning the parts of a time value
3.12 Generating html from easier markup
3.13 Log parsing example 2
3.14 Picking off a comment block
3.15 Parsing on non-printable characters
3.16 Finding strings that start with something

1. Target audience and references

The target audience is the REBOL beginner who is having a terrible time using the "parse" function. It is assumed that the reader knows how to write and run REBOL scripts.

References

REBOL documentation about parsing

Nick Antonaccio's definitive guide about doing useful stuff with REBOL

http://www.codeconscious.com/rebol/parse-tutorial.html

2. Introduction

Since this document is written for beginners by a beginner, some notes on parsing would be in order.

Parsing refers to taking apart text. To a computer, text in memory or in a file is just a big string of characters, one after the other. It has no meaning. A good example is a computer program. How does a language interpreter make "sense" of the big string of characters that is a computer program? Consider a line like this:

COUNTER = COUNTER + 1

What might a computer have to do to make sense of that?

Basically, it would have to go through that text one character at a time. It would ignore the leading blanks, and when it found the first "C" it would store that somehere. Then it would pick off characters, add them one after the other to that first "C" until it hit the first blank. Then it would have a "token" with the value of "COUNTER." It might then check a symbol table to see if that token had been encountered before, and make an entry in the table if "COUNTER" could not be found.

Then it would skip over blanks to the next non-blank which would be the equals sign. That might send the program off to some area where it would scan the input looking for items that constituted an "expression" which would be more tokens that are words, or operators, or numbers, and so on.

That operation of taking apart the text is referred to as parsing. You could do it in any language, but in some languages it would be hard. REBOL tries to make it easier by providing the "parse" function. But how do you do that in a general-purpose way that is useful? REBOL uses an embedded mini-languge to describe what you want done with the input.

This document is a collection of parsing examples compiled with the hope that with enough examples you will either be able to understand the parsing function or be able to find an example you can modify to solve your own parsing problem.

Remember this key concept as you read this: By a beginner for beginners.

3. Examples

These examples are harvested from wherever they could be found. If someone else wrote them, credit is given.

3.1 Finding input names on an html form

Let's say you have an html page with a form, and you want to write a program to process the form. The processing for one input item of the form is going to be similar to the processing for all items. You will have to check if the item has been filled in, check it length, check its value for valididy, and so on. You could, in theory, generate code to do all that if only you could get your hands on the names of the input items. Let's say the form looked like this:

<html>
<head><title></title></head>
<body>
<form action="http://website/cgi-bin/testprogram.py" method="post">
Data-name-1: <input type="text" size="30" name="DATA-NAME-1"><br>
Data-name-2: <input type="text" size="30" name="DATA-NAME-2"><br>
Data-name-3: <input type="text" size="30" name="DATA-NAME-3"><br>
<input type="submit" name="SUBMITBUTTON" value="Process">
</form>
</body>
</html>

How could you get your hands on the "name" attributes? Look at the text for a pattern. Each name is preceded by the text name=" and terminated by the next quote. So if you could scan through the name=" and then pick off characters to the next quote, and repeat that to the end of the text, you would have found all the names. The rule that makes that happen is as follows, assuming the html text is called HTMLTEXT.

NAMES: copy []
parse HTMLTEXT [
    any [thru {name="} copy NM to {"} (append NAMES to-string NM)] to end 
]

This example is tidy enough that you could package it into a useful function. The function could take the name of an html file that contained a form, and return a block of the input names on the form. Like this:

PARSE-INPUT-NAMES: func [
    HTMLFILE
    /local HTMLTEXT NAMES
] [
    HTMLTEXT: read HTMLFILE
    NAMES: copy []
    parse HTMLTEXT [
        any [thru {name="} copy NM to {"} (append NAMES to-string NM)] to end 
    ]
    return NAMES
]

3.2 Script documentation a la powershell

Here is a use of parsing that can aid with documenting scripts. Often, there is more motivation to keep documentation up to date if it is right there in the scripts.

In REBOL, comments can precede the REBOL header. Powershell has a scheme of placing documentation in the front of a script, with section headers. Consider the following idea for REBOL.

TITLE
Test script to make sure the program runs.
SUMMARY
This is a demo script that you can run to make sure
things are working.  It has the minimum code to do something.
DOCUMENTATION
Modify the code as needed to do something useful.
Make sure you have the interpreter installed if you want
to double-click the script to run it.  Otherwise, you
could make a batch file to run the script with the
command-line switches:
-i -s --script (script-name)
SCRIPT
REBOL [
    Title:  "COB global services module"
]
    alert "Script has run"

Notice how documentation is in front of the REBOL header, divided into sections called TITLE, SUMMARY, DOCUMENTATON, and the script itself under the section SCRIPT.

Assuming you don't use those words elsewhere in the script, it is possible to pull out those four sections by parsing the script file. Then you could do whatever you want with those sections. One idea would be to make a web page of documentation of all the scripts in some folder.

The code to extract those sections looks like this:

LIST-TITLE: ""
LIST-SUMMARY: ""
LIST-DOCUMENTATION: ""
LIST-SCRIPT: ""
LIST-FILE-DATA: read LIST-FILE-PATH
;;  -- Extract the four parts of the documentation.
parse/case LIST-FILE-DATA [thru "TITLE" copy LIST-TITLE to "SUMMARY"]
parse/case LIST-FILE-DATA [thru "SUMMARY" copy LIST-SUMMARY to "DOCUMENTATION"]
parse/case LIST-FILE-DATA [thru "DOCUMENTATION" copy LIST-DOCUMENTATION to "SCRIPT"]
parse/case LIST-FILE-DATA [thru "SCRIPT" copy LIST-SCRIPT to end]

The above is an example of "brute-force programming." You actually do not need to parse the input four times; one parse will do with the following rules:

parse/case LIST-FILE-DATA [ 
    thru "TITLE" copy LIST-TITLE to "SUMMARY" 
    thru "SUMMARY" copy LIST-SUMMARY to "DOCUMENTATION" 
    thru "DOCUMENTATION" copy LIST-DOCUMENTATION to "SCRIPT" 
    thru "SCRIPT" copy LIST-SCRIPT to end 
]

If you want to account for missing sections, AND you are sure that none of the section titles appears anywhere in the code, then you could do this:

parse/case LIST-FILE-DATA [ 
    any [ 
        thru "TITLE" copy LIST-TITLE to "SUMMARY" 
        | 
        thru "SUMMARY" copy LIST-SUMMARY to "DOCUMENTATION" 
        | 
        thru "DOCUMENTATION" copy LIST-DOCUMENTATION to "SCRIPT" 
        | 
        thru "SCRIPT" copy LIST-SCRIPT to end 
    ] 
]

Thanks to "johnk" on rebolforum.com for the last two variations.

3.3 Dissecting a CSV file

An item that begs to be parsed is a file with data separated by a delimiter. A CSV file is one such example, where the data items on each line are separated by commas, and often there is a header with column headings separated by commas. Something like this:

name,address,birthdate   
"John Smith","1800 W Old Shakopee Rd",01-JAN-2000 
"Jane Smith","2100 1ST Ave",01-FEB-1995  
"Jared Smith",3500 2ND St",01-MAR-1998

This data can be separated with the simple text splitting of "parse," no fancy rules required.

To get the data, you could skip over the known heading line, but here is another idea. Because REBOL is an interpreted language, it can sort of write itself on the fly. So we can do this with the first line of headings:

CSV-LINES: read/lines CSV-FILE ;; Read the file into a block of lines.
CSV-HEADINGS: parse/all first CSV-LINES ","
CSV-WORDS: copy []
foreach CSV-HEADING CSV-HEADINGS [
    if not-equal? "" trim CSV-HEADING [
        append CSV-WORDS to-word trim/all/with CSV-HEADING " #"
    ] 
]

Notice that the text file is a block of lines, and we parse the first line. For each parsed item, if it not a blank, we filter out spaces and other problem characters, and add it to a block of words, AFTER we convert it to a REBOL word (parsing will get it originally as a string). If the heading is a REBOL word, we can assign values to it.

Elsewhere in the program, we can parse a data line like this:

CSV-VALUES: parse/all CSV-RECORD ","
CSV-VAL-COUNTER: 0
foreach CSV-WORD CSV-WORDS [
    CSV-VAL-COUNTER: CSV-VAL-COUNTER + 1
    TEMP-VAL: pick CSV-VALUES CSV-VAL-COUNTER
    either TEMP-VAL [
        set CSV-WORD trim TEMP-VAL ;; can only trim if it exists   
    ] [
        set CSV-WORD TEMP-VAL
    ]
]

We break apart the data on commas, and then match the data items we parse, one-for-one, with the words we parsed from the heading line. For each heading word, we set its value to the matching data item parsed from the data line.

For the sake of clarity, other factors are left out. For example, what if a data field itself contains commas. As they say in math class, the rest is left as an exercise.

3.4 Parsing on something besides the usual delimiters

From "Ingo" on rebolforum.com.

Parse was designed to be powerful, and so the less you have to specify the more powerful it is. But sometimes you want to have a little more control for the price of having to do a little more work. In this example, the string to be parsed contains a lot of the characters that parse splits on automatically, but in this case we don't want that. We want to split on the pipe character only.

REBOL [Title: "Parse test"]
tmp: {a|bc|"d,e"|""something"more"|g} 
out: copy []
parse/all tmp [any [copy val to "|" (append out val) skip ] copy val to end (append out val)] 
probe out
halt

Running the above gives the desired result:

["a" "bc" {"d,e"} {""something"more"} "g"]
>>

3.5 Counting leading spaces

From "Chris" on rebolforum.com

Parsing can be used to find and count leading spaces in a string, by parsing off the leading spaces and finding the length of the resulting string, as shown in this example:

REBOL [Title: "Parse test"]
STR: "        XXXXXXXX YYYYY" 
BLANKS: charset " " 
NONBLANKS: complement BLANKS
parse/all STR [                          
    copy LEADING-SPACES any BLANKS 
    copy REST-OF-STRING to end 
]
LEADING-SPACE-COUNT: length? LEADING-SPACES
probe LEADING-SPACES
probe REST-OF-STRING
probe LEADING-SPACE-COUNT
halt

The result:

"        "
"XXXXXXXX YYYYY"
8
>>

3.6 Dividing text on the linefeed

Thanks to "sqlab" on rebolforum.com for help with this.

To read text as lines, you would use the read/lines function. But it can happen that text comes from somewhere else such that you can't read/lines, but the text still is lines. You can parse it apart on the linefeed in the following way. This example assumes you have some text on the clipboard.

RAW-LINES: read clipboard://
TEMP-LINES: parse/all RAW-LINES "^/"

In the above example, TEMP-LINES would be a block of text lines.

3.7 Character type testing

Normally you would want to use parsing to take apart text. But, the parse function also returns a true/false value if it gets to the end of the input and has not found any data that makes the parsing fail. You can take advantage of that by using parse to answer a yes-or-no kind of question, as in this example. A common function of checking user input is to ask if it a numeric item is indeed numeric, or if an alphanumeric item is indeed alphanumeric. In this example, the parsing is successful if the scan of the input indicates that every characer is indeed "some" number, or "some" letter, or "some" alphanumeric character.

REBOL [Title: "Character type tests"]
NUMERIC: charset [#"0" - #"9"]
ALPHABETIC: charset [#"A" - #"Z" #"a" - #"z"]
ALPHANUMERIC: union ALPHABETIC NUMERIC
STR: "12345" 
print [STR ":"]
print ["Numeric: " parse STR [some NUMERIC]] 
print ["Alphabetic: " parse STR [some ALPHABETIC]] 
print ["Alphanumeric: " parse STR [some ALPHANUMERIC]] 
print "------------------------------"
STR: "ABCde" 
print [STR ":"]
print ["Numeric: " parse STR [some NUMERIC]] 
print ["Alphabetic: " parse STR [some ALPHABETIC]] 
print ["Alphanumeric: " parse STR [some ALPHANUMERIC]] 
print "------------------------------"
STR: "123ab" 
print [STR ":"]
print ["Numeric: " parse STR [some NUMERIC]] 
print ["Alphabetic: " parse STR [some ALPHABETIC]] 
print ["Alphanumeric: " parse STR [some ALPHANUMERIC]] 
print "------------------------------"
STR: " a 1@" 
print [STR ":"]
print ["Numeric: " parse STR [some NUMERIC]] 
print ["Alphabetic: " parse STR [some ALPHABETIC]] 
print ["Alphanumeric: " parse STR [some ALPHANUMERIC]] 
print "------------------------------"
halt

The result:

12345 :
Numeric:  true
Alphabetic:  false
Alphanumeric:  true
------------------------------
ABCde :
Numeric:  false
Alphabetic:  true
Alphanumeric:  true
------------------------------
123ab :
Numeric:  false
Alphabetic:  false
Alphanumeric:  true
------------------------------
 a 1@ :
Numeric:  false
Alphabetic:  false
Alphanumeric:  false
------------------------------
>>

3.8 COBOL word validation

As a variant of the previous example, we can refine the type checking by testing for a more specific arrangement of characters, the COBOL word. In this case, the first character must be a letter, but after that anything goes as long as the remaining characters are letters, numbers, or the hyphen separator. If this understanding of the COBOL word format is outdated, the valid character set could be adjusted.

REBOL [Title: "Test for COBOL word"]
LETTER: charset [#"A" - #"Z"] 
DIGIT: charset [#"0" - #"9"]
COBOLWORD: [ 
    1 LETTER 0 29 [LETTER | DIGIT | "-"] 
]
;            0        1         2         3  
;            123456789012345678901234567890
print parse "123456"          COBOLWORD ;; should be false; starts with number
print parse "ABCDEF"          COBOLWORD ;; should be true; all letters
print parse "A-1-STEAK-SAUCE" COBOLWORD ;; should be true; starts with letter
print parse "4RUNNER"         COBOLWORD ;; should be false; starts with number
print parse "AVERAGE$"        COBOLWORD ;; should be false; invalid character
print parse "A----BCDE"       COBOLWORD ;; should be true; multiple - allowed
print parse "X"               COBOLWORD ;; should be true; single character allowed
print parse "A-VERY-LONG-WORD-WITH-VALID-CHAR" COBOLWORD ;;should be false; too long
halt

3.9 Delimited substring

A common programming operation is to extract from a string of characters some substring. Usually it is specified by a starting position and a length or a starting position and an ending position, like the fifth through the tenth characters.

Here is a different substring operation that extracts a substring starting with a specified character and ending with a different specified character. In the example below, the input string is a file name that contains an address but also has other characters, specifically a leading number that is not part of the address and a page number marked by a hyphen that also is not part of the address. We want to extract the part that is the address and not those other parts.

Looking at the pattern of the input, we can see that if we could extract from the first blank up to the hyphen, that would give us what we want. To make the solution more general, we will write a function that will let us specify any starting and ending delimiters.

DELIMITED-SUBSTRING: func [
    STR
    START 
    STOP
    /local RESULT
] [
    RESULT: copy ""
    parse/all STR [any [thru START copy RESULT to STOP] to end]
    return RESULT
]
;;Uncomment to test 
print DELIMITED-SUBSTRING "00510 1800 W Old Shakopee RD-P1.tif" " " "-"
print DELIMITED-SUBSTRING "00510 9800 PENN AVE S-P1.tif" " " "-"
print DELIMITED-SUBSTRING "00510_9800_PENN_AVE_S_P1.tif" " " "-"
halt

3.10 Log parsing example 1

Consider a log of configuration changes with lines like this:

5/11/2017 1:29 PM|10.1.223.15|10.1.223.15|May 11 13:29:19 CDT: %SYS-5-CONFIG_I: Configured from console by mrsmith on vty0 (10.1.250.78)

A line like this represents some action taken by someone (mrsmith) and we want to report when this action was taken and by whom.

The lines are all alike, so look for the pattern. The line is divided into parts by the pipe character. Part 1 is a date and time, part 2 is some IP address, part 3 is an IP address, and part 4 is a message.

We can take apart each line on the pipe character, and then when we get that fourth part of a line, the message, we can look for the string "by " the is right before the user ID, and we can look for the IP address of the computer where the change was made by finding what is between the parentheses.

In the example below, the whole log file has been read into a block of lines with the read/lines function.

ID-CHAR: charset "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
foreach LINE CONFIG-LINES [
    WS-DATE: copy ""
    WS-SWITCH: copy ""
    WS-IP: copy ""
    WS-MSG: copy ""
    WS-USERID: copy ""
    WS-FROM: copy ""
    set [WS-DATE WS-SWITCH WS-IP WS-MSG ] parse/all LINE "|"
    parse/all/case WS-MSG [
        thru " by " copy WS-USERID some ID-CHAR
        thru "(" copy WS-FROM TO ")"
    ]
;; -- Report on WS-DATE, WS-USERID, WS-FROM, as appropriate. 
]

3.11 Scanning the parts of a time value

This example obtains a block of three integers from a time value, the hours, minutes, and seconds. Obviously the example is taken from something larger that actually does something with those integers.

TIMESTRING: copy ""
TIMEBLOCK: copy []
TIMEPARTS: copy []
TIMESTRING: to-string TIMEVAL 
TIMEPARTS: parse/all TIMESTRING ":"
append TIMEBLOCK to-integer TIMEPARTS/1
append TIMEBLOCK to-integer TIMEPARTS/2
either TIMEPARTS/3 [
    append TIMEBLOCK to-integer TIMEPARTS/3
] [
    append TIMEBLOCK 0
]

3.12 Generating html from easier markup

If you have seen the makedoc2 program from the rebol-dot-org script library, you have seen a major parse-fest. This example is a little piece of that idea, simplified for instructional purposes. The explanation of what is going on is in the script comments.

The marked up input text is hard-coded into the program to make it self-contained. The operation of parsing the text and creating html is put into a function to which we pass the marked-up text.

Thanks to "Chris" on rebolforum-dot-com for guidance.

REBOL []

;; [---------------------------------------------------------------------------]
;; [ Demo for the purpose of trying to understand parsing.                     ]
;; [                                                                           ]
;; [ This demo will transform simple text with one markup item into html.      ]
;; [ The one markup item is the === on one line that indicates a heading       ]
;; [ at the h1 level.  The text on that line should be trimmed and surrounded  ]
;; [ by the h1 tags.  The other lines of text should be divided on the         ]
;; [ blank line and be surrounded by the "p" tags.                             ]
;; [                                                                           ]
;; [ So text like this:                                                        ]
;; [                                                                           ]
;; [ ===Heading 1                                                              ]
;; [                                                                           ]
;; [ Paragraph 1-1                                                             ]
;; [                                                                           ]
;; [ ===Heading 2                                                              ]
;; [                                                                           ]
;; [ Paragraph 2-1                                                             ]
;; [                                                                           ]
;; [ Should be transformed to this:                                            ]
;; [                                                                           ]
;; [ <h1>Heading 1</h1>                                                        ]
;; [ <p>Paragraph 1-1</p>                                                      ]
;; [ <h1>Heading 2</h1>                                                        ]
;; [ <p>Paragraph 2-1</p>                                                      ]
;; [                                                                           ]
;; [ ...or something equivalent.                                               ]
;; [---------------------------------------------------------------------------]

;; -- This is the sample input data that we will parse. 
IN-TEXT: {
===Heading one

This is a paragraph of text under heading one.
We would want it surrounded by the "p" tags.

This is a second paragraph
that should have its own set of "p" tags.

===A second heading

The above heading would be emitted with the "h1"
tags.

And here is a second paragraph under the second 
heading just to show things are working
}

;; -- This will be the parsed input data with its html tags.
HTML-OUT: copy ""

; anything but newlines 
content: complement charset "^/" 

scan-doc: func [text [string!]][ 
     ; parse/all for rebol 2 
    parse/all text [ 
        any [ 
            newline 
            | "===" opt " " copy part some content ( 
                append HTML-OUT rejoin [
                    "<h1>"
                    part
                    "</h1>"
                    newline
                ]
            ) 
            | copy part [some content any [newline some content]] ( 
                append HTML-OUT rejoin [
                    "<p>"
                    part
                    "</p>"
                    newline
                ]
            ) 
        ] 
    ] 
] 

;; -- Parse IN-TEXT, mark it up, and append it to HTML-OUT.

scan-doc IN-TEXT

;; -- Display the output and halt for probing.
probe HTML-OUT
halt

Running the above produces this:

{<h1>Heading one</h1>
<p>This is a paragraph of text under heading one.
We would want it surrounded by the "p" tags.</p>
<p>This is a second paragraph
that should have its own set of "p" tags.</p>
<h1>A second heading</h1>
<p>The above heading would be emitted with the "h1"
tags.</p>
<p>And here is a second paragraph under the second
heading just to show things are working</p>
}
>>

The above example went straight from the input text to some html output. But taking some inspiration from the makedoc2 program let's try something different. Parse the input text, but instead of going to html, store the parsed data in an intermediate block. The block will be repeating pairs of two things, an identifying word and a string of text identified by the word.

In our case, when we parse a heading (with the three equal signs), we will add two items to our intermediate block. The first will be the word 'heading and the second will be the heading text.

For other non-marked text, we will generate, for each string delimited by the blank line, the word 'para and the text itself.

After we have parsed the input and made the temporary block, we will go through the temporary block and generate the html from that. Theoretically, structuring the code like this could allow us to have one module for parsing into the intermediate block, and then other modules for translating the intermediate block into several forms of output. This appears to have been the plan behind the makedoc2 program. The part that generates the html could be pulled out and replaced by a module that generates a pdf file, for example.

REBOL []

;; -- This is the sample input data that we will parse. 
IN-TEXT: {
===Heading one

This is a paragraph of text under heading one.
We would want it surrounded by the "p" tags.

This is a second paragraph
that should have its own set of "p" tags.
===A second heading

The above heading would be emitted with the "h1"
tags.

And here is a second paragraph under the second 
heading just to show things are working
}

;; -- This will be the parsed input data with its html tags.
HTML-OUT: copy ""

;; -- This is the parsed data in an intermediate form. 
TEMP-STORAGE: copy []

; anything but newlines 
content: complement charset "^/" 

scan-doc: func [text [string!]][ 
     ; parse/all for rebol 2 
     collect [ 
         parse/all text [ 
             any [ 
                 newline 
                 | "===" opt " " copy part some content ( 
                     keep 'heading 
                     keep part 
                 ) 
                 | copy part [some content any [newline some content]] ( 
                     keep 'para 
                     keep part 
                 ) 
             ] 
         ] 
     ] 
] 

;; -- Parse IN-TEXT, mark it up, and append it to HTML-OUT.

TEMP-STORAGE: scan-doc IN-TEXT
probe TEMP-STORAGE
print "-------------------------------"

foreach [TAG TEXTLINE] TEMP-STORAGE [
    if equal? TAG 'heading [
        append HTML-OUT rejoin [
            "<h1>"
            TEXTLINE
            "</H1>" 
            newline
        ]
    ]
    if equal? TAG 'para [
        append HTML-OUT rejoin [
            "<p>"
            TEXTLINE
            "</p>" 
            newline
        ]
    ]
]

print HTML-OUT 

halt

Running the above script produces this result:

[heading "Heading one" para {This is a paragraph of text under heading one.
We would want it surrounded by the "p" tags.} para {This is a second paragraph
that should have its own set of "p" tags.} heading "A second heading" para {The above heading would be emitted with the "h1"
tags.} para {And here is a second paragraph under the second
heading just to show things are working}]
-------------------------------
<h1>Heading one</H1>
<p>This is a paragraph of text under heading one.
We would want it surrounded by the "p" tags.</p>
<p>This is a second paragraph
that should have its own set of "p" tags.</p>
<h1>A second heading</H1>
<p>The above heading would be emitted with the "h1"
tags.</p>
<p>And here is a second paragraph under the second
heading just to show things are working</p>
>>

So now, as an exercise that might or might not scale up to something bigger, let's try out that idea of using the intermediate block to generate some other form of output. Since we are testing on Windows, we will make a WORD document.

The trick we will use to make the WORD document is to generate a powershell script to make the WORD document, and then we would run the powershell script as a separate step. Or, the REBOL script could call the powershell script, but sometimes the calling operation does not work exactly as hoped. Here is a script that works for our small example. That is, it worked with WORD 2013 in 2018. You might get different results.

If you copy out this script to run it yourself, you will have to change the file name of the powershell script, and the file name of the word document, in the powershell code.

REBOL []

;; -- This is the sample input data that we will parse. 
IN-TEXT: {
===Heading one

This is a paragraph of text under heading one.
We would want it surrounded by the "p" tags.

This is a second paragraph
that should have its own set of "p" tags.

===A second heading

The above heading would be emitted with the "h1"
tags.

And here is a second paragraph under the second 
heading just to show things are working
}

;; -- This is the parsed data in an intermediate form. 
TEMP-STORAGE: copy []

; anything but newlines 
content: complement charset "^/" 

scan-doc: func [text [string!]][ 
     ; parse/all for rebol 2 
     collect [ 
         parse/all text [ 
             any [ 
                 newline 
                 | "===" opt " " copy part some content ( 
                     keep 'heading 
                     keep part 
                 ) 
                 | copy part [some content any [newline some content]] ( 
                     keep 'para 
                     keep part 
                 ) 
             ] 
         ] 
     ] 
] 

;; -- Parse IN-TEXT, put it into the intermediate form,
;; -- then use the intermediate form to generate a WORD document.

TEMP-STORAGE: scan-doc IN-TEXT
probe TEMP-STORAGE
print "-------------------------------"

;; -- Generate the WORD document by generating a powershell script
;; -- that will produce the document.  Cheating a bit.

PS-HEAD: {
$Word = New-Object -ComObject Word.Application
$Word.Visible = $True
$Document = $Word.Documents.Add()
$Selection = $Word.Selection
}

PS-FOOT: {
$Report = 'I:\ADocument.doc'
$Document.SaveAs([ref]$Report,[ref]$SaveFormat::wdFormatDocument)
$word.Quit()
$null = [System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$word)
[gc]::Collect()
[gc]::WaitForPendingFinalizers()
Remove-Variable word 
}

PS-H1: {
$Selection.Style = 'Title'
$Selection.TypeText("<%WS-H1%>")
$Selection.TypeParagraph()
}

PS-P: {
$Selection.Style = 'Heading 1'
$Selection.TypeText("<%WS-P%>")
$Selection.TypeParagraph()
}

POWERSHELL-SCRIPT: ""
POWERSHELL-SCRIPT-ID: %APowershellScript.ps1
WS-H1: ""
WS-P: ""

append POWERSHELL-SCRIPT rejoin [ 
    PS-HEAD
    newline
]

foreach [TAG TEXTLINE] TEMP-STORAGE [
    replace/all TEXTLINE newline " "
    replace/all TEXTLINE {"} "'"
    if equal? TAG 'heading [
        WS-H1: copy TEXTLINE
        append POWERSHELL-SCRIPT rejoin [
            build-markup PS-H1
            newline
        ]
    ]
    if equal? TAG 'para [
        WS-P: copy TEXTLINE
        append POWERSHELL-SCRIPT rejoin [
            build-markup PS-P
            newline
        ]
    ]
]

append POWERSHELL-SCRIPT rejoin [ 
    PS-FOOT
    newline
]

write POWERSHELL-SCRIPT-ID POWERSHELL-SCRIPT
probe POWERSHELL-SCRIPT 

halt

Running the above script produces a powershell script that you would have to run separately, plus the following console output.

[heading "Heading one" para {This is a paragraph of text under heading one.
We would want it surrounded by the "p" tags.} para {This is a second paragraph
that should have its own set of "p" tags.} heading "A second heading" para {The above heading would be emitted with the "h1"
tags.} para {And here is a second paragraph under the second
heading just to show things are working}]
-------------------------------
{
$Word = New-Object -ComObject Word.Application
$Word.Visible = $True
$Document = $Word.Documents.Add()
$Selection = $Word.Selection


$Selection.Style = 'Title'
$Selection.TypeText("Heading one")
$Selection.TypeParagraph()


$Selection.Style = 'Heading 1'
$Selection.TypeText("This is a paragraph of text under heading one. We would want it surrounded by the 'p' tags.")
$Selection.TypeParagraph()


$Selection.Style = 'Heading 1'
$Selection.TypeText("This is a second paragraph that should have its own set of 'p' tags.")
$Selection.TypeParagraph()


$Selection.Style = 'Title'
$Selection.TypeText("A second heading")
$Selection.TypeParagraph()


$Selection.Style = 'Heading 1'
$Selection.TypeText("The above heading would be emitted with the 'h1' tags.")
$Selection.TypeParagraph()


$Selection.Style = 'Heading 1'
$Selection.TypeText("And here is a second paragraph under the second  heading just to show things are working")
$Selection.TypeParagraph()


$Report = 'I:\ADocument.doc'
$Document.SaveAs([ref]$Report,[ref]$SaveFormat::wdFormatDocument)
$word.Quit()
$null = [System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$word)
[gc]::Collect()
[gc]::WaitForPendingFinalizers()
Remove-Variable word

}
>>

3.13 Log parsing example 2

In a very specific situation nobody ever will encounter, we have a bunch of log files created for personal logging and time reporting. The format was invented in the days of punch cards and has multi-line log entries delimited by lines with a dollar sign in the first position. A need arose to scan several files in one operation and so a way was needed to parse out the individual log entries. The details are explained in the sample script below which parses some hard-coded sample text.

This example has very little general-purpose value, but it does show how one could parse multiple-line pieces of text if one can find some pattern that marks off the pieces.

REBOL []
;; [---------------------------------------------------------------------------]
;; [ This is a sample program from a very specific situation where a person    ]
;; [ made up a format for personal log files and then after keeping logs for   ]
;; [ many years wanted to go back through them and search all entries for      ]
;; [ some key words.                                                           ]
;; [ A log file is a text file with repetitions of entries that look like      ]
;; [ thiS:                                                                     ]
;; [     $ TL (service-request-number) mm/dd/yyyy (hours) (activity-code)      ]
;; [      Multiple-line log text                                               ]
;; [     $ ENDTL                                                               ]
;; [ All we want to do is something simple; parse on "$ TL" though "$ ENDTL"   ]
;; [ and pick out the text in between.  With strings of text in hand,          ]
;; [ we can scan each for key words and report those we find.                  ]
;; [---------------------------------------------------------------------------]
LOG-TEXT: {
$ CO Monday
$ TL 9998 09/10/2018 7.5 MA
 Do a little of this and that.
 File a support case.
$ ENDTL
$ CO Tuesday
$ TL 9998 09/11/2018 7.5 MA
 Do a bunch of coding.
 Attend a meeting.
$ ENDTL
}

LOG-ENTRIES: copy []
parse LOG-TEXT [
    any [thru "$ TL" copy ENTRY to "$ ENDTL" (append LOG-ENTRIES ENTRY)] to end
]

probe LOG-ENTRIES 

halt

Running script produces this result.

[{ 9998 09/10/2018 7.5 MA
 Do a little of this and that.
 File a support case.
} { 9998 09/11/2018 7.5 MA
 Do a bunch of coding.
 Attend a meeting.
}]
>>

3.14 Picking off a comment block

Another variant of the previous idea of parsing off multi-line items, this example picks out a comment block from a coding language that uses comment blocks, in this case, T-SQL. In the example, the front of the script has a comment block delimited by /* and */, AND, the comments are in a REBOL-readable format. So, after we parse off the comment block, we can "load" it and work with the data items in the comment block. We could, for example, put a comment block on each script in a script library, and then parse them all to build an index of all scripts.

REBOL []

;; [---------------------------------------------------------------------------]
;; [ This sample parses off a comment block from the front of an sql script.   ]
;; [ If the comment block is formatted in a r-e-b-o-l readable format,         ]
;; [ data in the comment block could be used for indexing.                     ]
;; [---------------------------------------------------------------------------]

SCRIPTCODE: {
/*
AUTHOR: "sww"
DATE-WRITTEN: 01-JAN-1900
DATABASE: "accela"
SEARCH-WORDS: [crlf]
REMARKS: {Replace crlf in comments.
}
*/

SELECT REPLACE(ANY_COMMENTS, CHAR(13)+CHAR(10), ' ')
FROM dbo.TEMPSWW
}

COMMENTBLOCK: copy ""
parse/case SCRIPTCODE [thru "/*" copy COMMENTBLOCK to "*/"]
if greater? (length? COMMENTBLOCK) 0 [
    AUTHOR: none
    DATE-WRITTEN: none
    DATABASE: none
    SEARCH-WORDS: none
    REMARKS: none
    do load COMMENTBLOCK
]

probe AUTHOR
probe DATE-WRITTEN 
probe DATABASE
probe SEARCH-WORDS
probe REMARKS

halt

Running the above produces this:

"sww"
1-Jan-1900
"accela"
[crlf]
"Replace crlf in comments.^/"
>>

3.15 Parsing on non-printable characters

In the area of simple text splitting, you can split on characters other than those you can type on a line of code. To specify any hexadecimal character, use the "caret" notation as shown in the example below.

In the example below, the clipboard contains the results of a query from SQL Server. The way to load the clipboard in this manner is to run a query specifying "results to grid." Then right-click the results and "select all," then "copy with headers." This loads the clipboard. Then run the sample program below. It will ask for the base part of a file name, read the clipboard, parse the clipboard on the carriage return an linefeed characters to get lines, parse each line on the horizontal tab character to get fields, and then assemble a CSV file with the name specified.

REBOL [
    Title: "Clipboard to CSV"
    Purpose: {Get a file name from the operator, a string of lines
    from the clipboard, and make the indicated CSV file from the
    clipped lines.}
]

CLIPBOARD-LINES: func [
    /local CLIPSTRING LINEBLOCK
] [
    LINEBLOCK: copy []
    CLIPSTRING: copy ""
    CLIPSTRING: read clipboard://
    LINEBLOCK: parse/all CLIPSTRING "^(0D)^(0A)"
    return LINEBLOCK
]

CSV-FILEID-X: none
CSV-FILEID: none
CSV-FILE: ""
CSV-REC: ""
FIELDCOUNT: 0
COMMACOUNTER: 0
CREATE-FILE: does [
    if not CSV-FILEID-X: get-face MAIN-FILEID [
        if equal? CSV-FILEID-X "" [
            alert "No file ID specified"
            exit
        ]
        alert "No file ID specified"
        exit
    ]
    CSV-FILEID: to-file rejoin [
        trim CSV-FILEID-X
        ".csv"
    ]
    LINES: CLIPBOARD-LINES
    foreach LINE LINES [ 
        CSV-REC: copy "" 
        FIELDS: copy []
        FIELDS: parse/all LINE "^(09)" 
        FIELDCOUNT: length? fields
        COMMACOUNT: 0
        foreach FIELD FIELDS [
            append CSV-REC trim FIELD
            COMMACOUNT: COMMACOUNT + 1
            if lesser? COMMACOUNT FIELDCOUNT [
                append CSV-REC ","
            ]
        ]
        append CSV-REC newline
        append CSV-FILE CSV-REC
    ]
    write CSV-FILEID CSV-FILE
    alert "Done."
]

view center-face layout [
    across
    label "CSV filename (without the .csv)"
    return
    MAIN-FILEID: field 400
    return
    button "Create file" [CREATE-FILE]
    button "Quit" [quit]
]

3.16 Finding strings that start with something

This example, which could be useful in several situations, shows a bit of what one is trying to accomplish with parse rules. In the example, we want to identify file names that start with certain characters. If the file name contains those characters elsewhere besides at the start, we don't care.

In the parse rule, the rule can be read as meaning that the data to be parsed must match "CV_Permits_" at the start but then can contain anything else after that up to the end.

Notice also that parsing might be a bit of overkill for this particular application since all we want to do is find "CV_Permits_" at the start of the name, which is done easily with the find/match function.

Thanks to Chris of rebolforum.com for the guidance.

REBOL []

FILENAMES: [
    %CV_Permits_2-1-2018.txt
    %CV_Permits_2-4-2018.txt
    %Log_CV_Permits_2-1-2018.txt
    %Log_CV_Permits_2-4-2018.txt
]

foreach ID FILENAMES [ 
     if find/match ID "CV_Permits_" [ 
         print ["Process:" ID] 
     ] 
] 

print "--------------------------------"

foreach ID FILENAMES [ 
     if parse ID ["CV_Permits_" to end][ 
         print ["Process:" ID] 
     ] 
] 

halt

Here is the result of the above example.

Process: CV_Permits_2-1-2018.txt
Process: CV_Permits_2-4-2018.txt
--------------------------------
Process: CV_Permits_2-1-2018.txt
Process: CV_Permits_2-4-2018.txt
>>