Prolog DCG "distilled" - Part 1
Introduction
Here’s a new series of short articles that concentrate on one powerful tool within Prolog Programming, “Definite Clause Grammars”. These are covered in a number of tutorials on the web (refer to the information in the appendix).
However I found a step by step approach, "distilled" with tests, to be the most appropriate learning path, as the topic might otherwise be a bit daunting for newcomers to comprehend.
The context where I personally used the technology is for parsing application server logs and parsing programming languages such as Java.
However I found a step by step approach, "distilled" with tests, to be the most appropriate learning path, as the topic might otherwise be a bit daunting for newcomers to comprehend.
The context where I personally used the technology is for parsing application server logs and parsing programming languages such as Java.
A number of problems arise when parsing text, such as:
- the ability to reuse generic structures such as lists and options to structure grammars
- the ease of capturing the elements recognized by the parser, and maintaining state across several rules (productions)
- the handling of ambiguous contexts when looking up text tokens (“lookahead”)
- the handling of rules with left recursion
The effectiveness of the Prolog DCG comes from offering a great deal of control over such aspects.
Forthcoming articles
- Building your own vocabulary to structure grammars, covered in this article and the next.
- Capturing and assembling textual elements from rules.
- Handling conflicting rules with lookahead information.
- Handling left recursion.
- Parsing Xml Files
- Parsing Yaml Files
- Parsing Java 8 Code
Defining reusable building blocks for grammars
The code and tests for this article may be found [here]
The vocabulary defines 4 reusable DCG rules, which may be combined with other rules as follows:
hat(H) --> {H='hat'}. % terminal (atom)
hut(H) --> {H='hut'}. % terminal (atom)
chapeau(H) --> ( % grammar production
hat(H)
| hut(H)
).
chapeaux(Cs) --> list_of(chapeau, Cs). % list 0.. chapeaux
chap_min(Cs) --> list_min1(chapeau, Cs). % list 1.. chapeaux
chap_sep(Cs) --> list_sep(chapeau, ',', Cs).% list 1.. chapeaux with separator
chap_opt(C) --> maybe(chapeau, C). % optional chapeau
One of the tests may be run in SWI-Prolog with those steps:
?- consult('/path/to/dcg_vocabulary.pl').
true.
?- test_list_of_2(Cs).
Cs = [hat, hut];
false.
This technique of combining DCGs to build a common vocabulary (list_of, list_sep, etc... ) is an important step to structure more complex grammars for any substantial parsing work.
?- test_list_of_2(Cs).
Cs = [hat, hut];
false.
Defining reusable building blocks for parsing strings
Similar to the previous rules designed to recognize lists of atoms such as ‘hat’, ‘hut’ etc…, one can define the following 4 rules (conc_of, conc_min1, conc_sep, maybe_str) to recognize concatenation of strings (in SWI-Prolog strings can be converted to lists of character codes thanks to the string_codes/2 predicate).
The conc_sep rule presented below is slightly more generic than list_sep, as it accepts a reference to a rule defining which sequence of characters constitutes a separator:
bird1(B) --> "cuckoo", { string_codes("cuckoo", B) }. % codes are [99, 117, 99, 107, 111, 111]
bird2(B) --> "sparrow", { string_codes("sparrow", B) }. % codes are [115…]
birdie(B) --> ( bird1(B) | bird2(B) ). % grammar production
wspace(W) --> {W=32}. % code is 32
% separator is either the comma or a sequence of whitespaces
separ(C) --> (
list_min1(wspace, C)
| ",", {string_codes(",", C)}
).
birdies(Bs) --> conc_sep(birdie, separ, Bs). % a list (concatenation) of birds, with a separator
bird_opt(B) --> maybe_str(birdie, B). % a bird or nothing
In the next article we'll see in detail how this works, with the help of the SWI-Prolog debugger.
Additional Reading
The conc_sep rule presented below is slightly more generic than list_sep, as it accepts a reference to a rule defining which sequence of characters constitutes a separator:
bird1(B) --> "cuckoo", { string_codes("cuckoo", B) }. % codes are [99, 117, 99, 107, 111, 111]
bird2(B) --> "sparrow", { string_codes("sparrow", B) }. % codes are [115…]
birdie(B) --> ( bird1(B) | bird2(B) ). % grammar production
wspace(W) --> {W=32}. % code is 32
% separator is either the comma or a sequence of whitespaces
separ(C) --> (
list_min1(wspace, C)
| ",", {string_codes(",", C)}
).
birdies(Bs) --> conc_sep(birdie, separ, Bs). % a list (concatenation) of birds, with a separator
bird_opt(B) --> maybe_str(birdie, B). % a bird or nothing
The predicates presented here are analogous to those offered by this SWI Prolog library: dcg/high_order
This article covers EBNF grammar syntax and support with library dcg4pt:(PDF)
Comments
Post a Comment