Parsing by examples Date written: 14-SEP-2018 Date revised: 14-SEP-2018 The "parse" function in REBOL can be a bit hard to understand. This document is not an attempt to explain it. Instead, it is a collection of examples that we hope will become large enough so that if you have a question about parsing you will be able to find another person's solution and modify that for your own use. ===Target audience and references The target audience is the REBOL beginner who is having a terrible time using the "parse" function. It is assumed that the reader knows how to write and run REBOL scripts. References REBOL documentation about parsing Nick Antonaccio's definitive guide about doing useful stuff with REBOL http://www.codeconscious.com/rebol/parse-tutorial.html ===Introduction Since this document is written for beginners by a beginner, some notes on parsing would be in order. Parsing refers to taking apart text. To a computer, text in memory or in a file is just a big string of characters, one after the other. It has no meaning. A good example is a computer program. How does a language interpreter make "sense" of the big string of characters that is a computer program? Consider a line like this: COUNTER = COUNTER + 1 What might a computer have to do to make sense of that? Basically, it would have to go through that text one character at a time. It would ignore the leading blanks, and when it found the first "C" it would store that somehere. Then it would pick off characters, add them one after the other to that first "C" until it hit the first blank. Then it would have a "token" with the value of "COUNTER." It might then check a symbol table to see if that token had been encountered before, and make an entry in the table if "COUNTER" could not be found. Then it would skip over blanks to the next non-blank which would be the equals sign. That might send the program off to some area where it would scan the input looking for items that constituted an "expression" which would be more tokens that are words, or operators, or numbers, and so on. That operation of taking apart the text is referred to as parsing. You could do it in any language, but in some languages it would be hard. REBOL tries to make it easier by providing the "parse" function. But how do you do that in a general-purpose way that is useful? REBOL uses an embedded mini-languge to describe what you want done with the input. This document is a collection of parsing examples compiled with the hope that with enough examples you will either be able to understand the parsing function or be able to find an example you can modify to solve your own parsing problem. Remember this key concept as you read this: By a beginner for beginners. ===Examples These examples are harvested from wherever they could be found. If someone else wrote them, credit is given. ---Finding input names on an html form Let's say you have an html page with a form, and you want to write a program to process the form. The processing for one input item of the form is going to be similar to the processing for all items. You will have to check if the item has been filled in, check it length, check its value for valididy, and so on. You could, in theory, generate code to do all that if only you could get your hands on the names of the input items. Let's say the form looked like this:
Paragraph 1-1
] ;; [Paragraph 2-1
] ;; [ ] ;; [ ...or something equivalent. ] ;; [---------------------------------------------------------------------------] ;; -- This is the sample input data that we will parse. IN-TEXT: { ===Heading one This is a paragraph of text under heading one. We would want it surrounded by the "p" tags. This is a second paragraph that should have its own set of "p" tags. ===A second heading The above heading would be emitted with the "h1" tags. And here is a second paragraph under the second heading just to show things are working } ;; -- This will be the parsed input data with its html tags. HTML-OUT: copy "" ; anything but newlines content: complement charset "^/" scan-doc: func [text [string!]][ ; parse/all for rebol 2 parse/all text [ any [ newline | "===" opt " " copy part some content ( append HTML-OUT rejoin [ "" part "
" newline ] ) ] ] ] ;; -- Parse IN-TEXT, mark it up, and append it to HTML-OUT. scan-doc IN-TEXT ;; -- Display the output and halt for probing. probe HTML-OUT halt Running the above produces this: {This is a paragraph of text under heading one. We would want it surrounded by the "p" tags.
This is a second paragraph that should have its own set of "p" tags.
The above heading would be emitted with the "h1" tags.
And here is a second paragraph under the second heading just to show things are working
} >> The above example went straight from the input text to some html output. But taking some inspiration from the makedoc2 program let's try something different. Parse the input text, but instead of going to html, store the parsed data in an intermediate block. The block will be repeating pairs of two things, an identifying word and a string of text identified by the word. In our case, when we parse a heading (with the three equal signs), we will add two items to our intermediate block. The first will be the word 'heading and the second will be the heading text. For other non-marked text, we will generate, for each string delimited by the blank line, the word 'para and the text itself. After we have parsed the input and made the temporary block, we will go through the temporary block and generate the html from that. Theoretically, structuring the code like this could allow us to have one module for parsing into the intermediate block, and then other modules for translating the intermediate block into several forms of output. This appears to have been the plan behind the makedoc2 program. The part that generates the html could be pulled out and replaced by a module that generates a pdf file, for example. REBOL [] ;; -- This is the sample input data that we will parse. IN-TEXT: { ===Heading one This is a paragraph of text under heading one. We would want it surrounded by the "p" tags. This is a second paragraph that should have its own set of "p" tags. ===A second heading The above heading would be emitted with the "h1" tags. And here is a second paragraph under the second heading just to show things are working } ;; -- This will be the parsed input data with its html tags. HTML-OUT: copy "" ;; -- This is the parsed data in an intermediate form. TEMP-STORAGE: copy [] ; anything but newlines content: complement charset "^/" scan-doc: func [text [string!]][ ; parse/all for rebol 2 collect [ parse/all text [ any [ newline | "===" opt " " copy part some content ( keep 'heading keep part ) | copy part [some content any [newline some content]] ( keep 'para keep part ) ] ] ] ] ;; -- Parse IN-TEXT, mark it up, and append it to HTML-OUT. TEMP-STORAGE: scan-doc IN-TEXT probe TEMP-STORAGE print "-------------------------------" foreach [TAG TEXTLINE] TEMP-STORAGE [ if equal? TAG 'heading [ append HTML-OUT rejoin [ "" TEXTLINE "
" newline ] ] ] print HTML-OUT halt Running the above script produces this result: [heading "Heading one" para {This is a paragraph of text under heading one. We would want it surrounded by the "p" tags.} para {This is a second paragraph that should have its own set of "p" tags.} heading "A second heading" para {The above heading would be emitted with the "h1" tags.} para {And here is a second paragraph under the second heading just to show things are working}] -------------------------------This is a paragraph of text under heading one. We would want it surrounded by the "p" tags.
This is a second paragraph that should have its own set of "p" tags.
The above heading would be emitted with the "h1" tags.
And here is a second paragraph under the second heading just to show things are working
>> So now, as an exercise that might or might not scale up to something bigger, let's try out that idea of using the intermediate block to generate some other form of output. Since we are testing on Windows, we will make a WORD document. The trick we will use to make the WORD document is to generate a powershell script to make the WORD document, and then we would run the powershell script as a separate step. Or, the REBOL script could call the powershell script, but sometimes the calling operation does not work exactly as hoped. Here is a script that works for our small example. That is, it worked with WORD 2013 in 2018. You might get different results. If you copy out this script to run it yourself, you will have to change the file name of the powershell script, and the file name of the word document, in the powershell code. REBOL [] ;; -- This is the sample input data that we will parse. IN-TEXT: { ===Heading one This is a paragraph of text under heading one. We would want it surrounded by the "p" tags. This is a second paragraph that should have its own set of "p" tags. ===A second heading The above heading would be emitted with the "h1" tags. And here is a second paragraph under the second heading just to show things are working } ;; -- This is the parsed data in an intermediate form. TEMP-STORAGE: copy [] ; anything but newlines content: complement charset "^/" scan-doc: func [text [string!]][ ; parse/all for rebol 2 collect [ parse/all text [ any [ newline | "===" opt " " copy part some content ( keep 'heading keep part ) | copy part [some content any [newline some content]] ( keep 'para keep part ) ] ] ] ] ;; -- Parse IN-TEXT, put it into the intermediate form, ;; -- then use the intermediate form to generate a WORD document. TEMP-STORAGE: scan-doc IN-TEXT probe TEMP-STORAGE print "-------------------------------" ;; -- Generate the WORD document by generating a powershell script ;; -- that will produce the document. Cheating a bit. PS-HEAD: { $Word = New-Object -ComObject Word.Application $Word.Visible = $True $Document = $Word.Documents.Add() $Selection = $Word.Selection } PS-FOOT: { $Report = 'I:\ADocument.doc' $Document.SaveAs([ref]$Report,[ref]$SaveFormat::wdFormatDocument) $word.Quit() $null = [System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$word) [gc]::Collect() [gc]::WaitForPendingFinalizers() Remove-Variable word } PS-H1: { $Selection.Style = 'Title' $Selection.TypeText("<%WS-H1%>") $Selection.TypeParagraph() } PS-P: { $Selection.Style = 'Heading 1' $Selection.TypeText("<%WS-P%>") $Selection.TypeParagraph() } POWERSHELL-SCRIPT: "" POWERSHELL-SCRIPT-ID: %APowershellScript.ps1 WS-H1: "" WS-P: "" append POWERSHELL-SCRIPT rejoin [ PS-HEAD newline ] foreach [TAG TEXTLINE] TEMP-STORAGE [ replace/all TEXTLINE newline " " replace/all TEXTLINE {"} "'" if equal? TAG 'heading [ WS-H1: copy TEXTLINE append POWERSHELL-SCRIPT rejoin [ build-markup PS-H1 newline ] ] if equal? TAG 'para [ WS-P: copy TEXTLINE append POWERSHELL-SCRIPT rejoin [ build-markup PS-P newline ] ] ] append POWERSHELL-SCRIPT rejoin [ PS-FOOT newline ] write POWERSHELL-SCRIPT-ID POWERSHELL-SCRIPT probe POWERSHELL-SCRIPT halt Running the above script produces a powershell script that you would have to run separately, plus the following console output. [heading "Heading one" para {This is a paragraph of text under heading one. We would want it surrounded by the "p" tags.} para {This is a second paragraph that should have its own set of "p" tags.} heading "A second heading" para {The above heading would be emitted with the "h1" tags.} para {And here is a second paragraph under the second heading just to show things are working}] ------------------------------- { $Word = New-Object -ComObject Word.Application $Word.Visible = $True $Document = $Word.Documents.Add() $Selection = $Word.Selection $Selection.Style = 'Title' $Selection.TypeText("Heading one") $Selection.TypeParagraph() $Selection.Style = 'Heading 1' $Selection.TypeText("This is a paragraph of text under heading one. We would want it surrounded by the 'p' tags.") $Selection.TypeParagraph() $Selection.Style = 'Heading 1' $Selection.TypeText("This is a second paragraph that should have its own set of 'p' tags.") $Selection.TypeParagraph() $Selection.Style = 'Title' $Selection.TypeText("A second heading") $Selection.TypeParagraph() $Selection.Style = 'Heading 1' $Selection.TypeText("The above heading would be emitted with the 'h1' tags.") $Selection.TypeParagraph() $Selection.Style = 'Heading 1' $Selection.TypeText("And here is a second paragraph under the second heading just to show things are working") $Selection.TypeParagraph() $Report = 'I:\ADocument.doc' $Document.SaveAs([ref]$Report,[ref]$SaveFormat::wdFormatDocument) $word.Quit() $null = [System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$word) [gc]::Collect() [gc]::WaitForPendingFinalizers() Remove-Variable word } >> ---Log parsing example 2 In a very specific situation nobody ever will encounter, we have a bunch of log files created for personal logging and time reporting. The format was invented in the days of punch cards and has multi-line log entries delimited by lines with a dollar sign in the first position. A need arose to scan several files in one operation and so a way was needed to parse out the individual log entries. The details are explained in the sample script below which parses some hard-coded sample text. This example has very little general-purpose value, but it does show how one could parse multiple-line pieces of text if one can find some pattern that marks off the pieces. REBOL [] ;; [---------------------------------------------------------------------------] ;; [ This is a sample program from a very specific situation where a person ] ;; [ made up a format for personal log files and then after keeping logs for ] ;; [ many years wanted to go back through them and search all entries for ] ;; [ some key words. ] ;; [ A log file is a text file with repetitions of entries that look like ] ;; [ thiS: ] ;; [ $ TL (service-request-number) mm/dd/yyyy (hours) (activity-code) ] ;; [ Multiple-line log text ] ;; [ $ ENDTL ] ;; [ All we want to do is something simple; parse on "$ TL" though "$ ENDTL" ] ;; [ and pick out the text in between. With strings of text in hand, ] ;; [ we can scan each for key words and report those we find. ] ;; [---------------------------------------------------------------------------] LOG-TEXT: { $ CO Monday $ TL 9998 09/10/2018 7.5 MA Do a little of this and that. File a support case. $ ENDTL $ CO Tuesday $ TL 9998 09/11/2018 7.5 MA Do a bunch of coding. Attend a meeting. $ ENDTL } LOG-ENTRIES: copy [] parse LOG-TEXT [ any [thru "$ TL" copy ENTRY to "$ ENDTL" (append LOG-ENTRIES ENTRY)] to end ] probe LOG-ENTRIES halt Running script produces this result. [{ 9998 09/10/2018 7.5 MA Do a little of this and that. File a support case. } { 9998 09/11/2018 7.5 MA Do a bunch of coding. Attend a meeting. }] >> ---Picking off a comment block Another variant of the previous idea of parsing off multi-line items, this example picks out a comment block from a coding language that uses comment blocks, in this case, T-SQL. In the example, the front of the script has a comment block delimited by /* and */, AND, the comments are in a REBOL-readable format. So, after we parse off the comment block, we can "load" it and work with the data items in the comment block. We could, for example, put a comment block on each script in a script library, and then parse them all to build an index of all scripts. REBOL [] ;; [---------------------------------------------------------------------------] ;; [ This sample parses off a comment block from the front of an sql script. ] ;; [ If the comment block is formatted in a r-e-b-o-l readable format, ] ;; [ data in the comment block could be used for indexing. ] ;; [---------------------------------------------------------------------------] SCRIPTCODE: { /* AUTHOR: "sww" DATE-WRITTEN: 01-JAN-1900 DATABASE: "accela" SEARCH-WORDS: [crlf] REMARKS: {Replace crlf in comments. } */ SELECT REPLACE(ANY_COMMENTS, CHAR(13)+CHAR(10), ' ') FROM dbo.TEMPSWW } COMMENTBLOCK: copy "" parse/case SCRIPTCODE [thru "/*" copy COMMENTBLOCK to "*/"] if greater? (length? COMMENTBLOCK) 0 [ AUTHOR: none DATE-WRITTEN: none DATABASE: none SEARCH-WORDS: none REMARKS: none do load COMMENTBLOCK ] probe AUTHOR probe DATE-WRITTEN probe DATABASE probe SEARCH-WORDS probe REMARKS halt Running the above produces this: "sww" 1-Jan-1900 "accela" [crlf] "Replace crlf in comments.^/" >> ---Parsing on non-printable characters In the area of simple text splitting, you can split on characters other than those you can type on a line of code. To specify any hexadecimal character, use the "caret" notation as shown in the example below. In the example below, the clipboard contains the results of a query from SQL Server. The way to load the clipboard in this manner is to run a query specifying "results to grid." Then right-click the results and "select all," then "copy with headers." This loads the clipboard. Then run the sample program below. It will ask for the base part of a file name, read the clipboard, parse the clipboard on the carriage return an linefeed characters to get lines, parse each line on the horizontal tab character to get fields, and then assemble a CSV file with the name specified. REBOL [ Title: "Clipboard to CSV" Purpose: {Get a file name from the operator, a string of lines from the clipboard, and make the indicated CSV file from the clipped lines.} ] CLIPBOARD-LINES: func [ /local CLIPSTRING LINEBLOCK ] [ LINEBLOCK: copy [] CLIPSTRING: copy "" CLIPSTRING: read clipboard:// LINEBLOCK: parse/all CLIPSTRING "^(0D)^(0A)" return LINEBLOCK ] CSV-FILEID-X: none CSV-FILEID: none CSV-FILE: "" CSV-REC: "" FIELDCOUNT: 0 COMMACOUNTER: 0 CREATE-FILE: does [ if not CSV-FILEID-X: get-face MAIN-FILEID [ if equal? CSV-FILEID-X "" [ alert "No file ID specified" exit ] alert "No file ID specified" exit ] CSV-FILEID: to-file rejoin [ trim CSV-FILEID-X ".csv" ] LINES: CLIPBOARD-LINES foreach LINE LINES [ CSV-REC: copy "" FIELDS: copy [] FIELDS: parse/all LINE "^(09)" FIELDCOUNT: length? fields COMMACOUNT: 0 foreach FIELD FIELDS [ append CSV-REC trim FIELD COMMACOUNT: COMMACOUNT + 1 if lesser? COMMACOUNT FIELDCOUNT [ append CSV-REC "," ] ] append CSV-REC newline append CSV-FILE CSV-REC ] write CSV-FILEID CSV-FILE alert "Done." ] view center-face layout [ across label "CSV filename (without the .csv)" return MAIN-FILEID: field 400 return button "Create file" [CREATE-FILE] button "Quit" [quit] ] ---Finding strings that start with something This example, which could be useful in several situations, shows a bit of what one is trying to accomplish with parse rules. In the example, we want to identify file names that start with certain characters. If the file name contains those characters elsewhere besides at the start, we don't care. In the parse rule, the rule can be read as meaning that the data to be parsed must match "CV_Permits_" at the start but then can contain anything else after that up to the end. Notice also that parsing might be a bit of overkill for this particular application since all we want to do is find "CV_Permits_" at the start of the name, which is done easily with the find/match function. Thanks to Chris of rebolforum.com for the guidance. REBOL [] FILENAMES: [ %CV_Permits_2-1-2018.txt %CV_Permits_2-4-2018.txt %Log_CV_Permits_2-1-2018.txt %Log_CV_Permits_2-4-2018.txt ] foreach ID FILENAMES [ if find/match ID "CV_Permits_" [ print ["Process:" ID] ] ] print "--------------------------------" foreach ID FILENAMES [ if parse ID ["CV_Permits_" to end][ print ["Process:" ID] ] ] halt Here is the result of the above example. Process: CV_Permits_2-1-2018.txt Process: CV_Permits_2-4-2018.txt -------------------------------- Process: CV_Permits_2-1-2018.txt Process: CV_Permits_2-4-2018.txt >>