Extensible Xml Diff with SWI-Prolog

The previous article presented 2 alternatives for calculating the diff predicates, such as

diff(on(//'NEST', @type), from(//bird, @name)).

using a common strategy for matching objects, and a different strategy for matching sub-nodes of these objects, in the tree_d2_map and tree_d2_lcs implementations respectively.

This approach led however to some code duplication, and thus to the increase of the number of predicates which one might have to write to produce a diff.

The present article now introduces a generic diff implementation that works as a “swiss-army knife”, enabling on-the-spot combinations of node set matching strategies.

Resources for this article

Prolog source code and XML examples are available on [github], and are linked to this article.

To help enhance the readability of the predicates’ output, some Xml elements below are sometimes “ellipsed” with a “…” .

A new code structure for decoupling things

This solution relies on the SWI Prolog shift/reset predicates, which support the concept of “delimited continuations”.

This style of programming enables the definition of cooperating Prolog predicates as co-routines which have the capability to ask for or yield Xml Node Set variables, and to interrupt themselves for that exchange.

The diff implementation now becomes an orchestrating predicate, which coordinates how these co-routines (on() and from()) exchange information. The dialog is based on 4 “message” terms and predicates reproduced here

ask_object_match(SE1, SE2) :-

shift(ask_object_match(SE1, SE2)).

yield_object_match(SE1, SE2) :-

shift(yield_object_match(SE1, SE2)).

yield_subnode_match(SE1, SE2, B1, B2) :-

shift(yield_subnode_match(SE1, SE2, B1, B2)).

yield_no_subnode_match(SE1, SE2, B) :-

shift(yield_no_subnode_match(SE1, SE2, B)).

whereby SE1 and SE2 refer to some Xml Object match and B1, B2 or B to some Sub Node match.

The whole “message-passing” technique is in fact based on Prolog unification, happening within the confines of the diff predicates itself, between the “shifted” ask and yield terms:

Term1 = ask_object_match(SE1, SE2) ->

reset(ObjectMatcher,Term2,ObjectMatcher1),

( Term2 == 0 ->

SE1 = eof,

SE2 = eof,

% unfinished goal from SubNodeMatcher

call(SubNodeMatcher1)

; Term2 = yield_object_match(SE1, SE2) ->

% SE1 and SE2 now unify with ask_object_match(SE1, SE2), recursion with unfinished goals
diff(ObjectMatcher1,SubNodeMatcher1)

)

The reset API enables the diff predicate to control at which “point” (ObjectMatcher1, SubNodeMatcher1) the cooperating predicates on() and from() might be resumed, after yielding (“shifting”) their terms.

This happens either through the use of the call predicate or through the recursive invocation of diff itself.

Stepping back from the code

Whew ! I must confess this technique (coroutines) took me a little while at first to getting used to.

However one realizes quickly the huge benefit and elegance for bigger Prolog code bases, as:

The code is now decoupled into various collaborating predicates (here on and from) which know absolutely nothing about each other’s implementation
The collaboration (in diff) works through the use of defined variables and “messaging” terms (API or “interface”)
A different collaboration/orchestration altogether is possible without the need for rewriting any of the collaborators, provided that the “messaging” API / interface holds
The various predicates can be combined on the spot, to define a flexible domain specific language:

% match branch objects, find diff on bird level

diff(on(//branch, @height), from(//bird, @name))

% match NEST objects using branch and NEST attributes, find diff on bird level

diff(on(//branch, @height, //’NEST’, @type), from(//bird, @name))

etc…

Overcoming some limits of the previous solution

The map-based algorithm presented in the previous article is based on an identifier for sub-nodes (e.g. @name), that enables a matcher predicate to find nodes to associate for the diff (yield_subnode_match).

However this solution falls short in the case where these identifiers are not unique at the level of the matched sub-nodes. In this case is the actual navigation path from the matched object to the matched sub-node required, to uniquely distinguish between these.

In a sense can this problem be seen as a reverse problem of the xpath predicate:

determine a list of xpath-like expressions from a parent to select a given sub-node, such as:

/bird(3) from NEST to bird

/’NEST’(1)/bird(2) from branch to bird

/bird(2, @name=’huhu’) from NEST to bird

/’NEST’(1, @type=‘owl-nest‘)/bird(2, @name=’huhu’) from branch to bird

This calculation must happen in a top-down fashion, as each Xml element holds a list of its child nodes. This expression is in this example determined via a recursive navig_xpath predicate, which performs a depth-first search for each given Object sub-tree, whilst maintaining the path tried so far as an accumulator list.

This predicate uses the difference list technique (see for example [Clause and Effect Worksheet 29]) to collect search results as a list whilst walking through the tree structure.

There are actually 3 versions incrementally defined in the effective_xml_diff module, for the sake of seeing step by step the impact of small additional requirements.

The first version does not calculate the relative indexes of visited nodes along the path.

The second version uses the indexl2 predicate to calculate these indexes. It uses a list (L) to keep track of the last index for any visited node.

The third version uses the indexl3 predicate to calculate the index of the goal node based on which attribute (such as name) defines its identity with regards to its parent (“relative” identity).

These versions are shown in the following test suite: [navig_xpath_t.pro]

Running the Diff now correctly compares the birds based on their relative identity,

the first bird named ‘huhu’ is compared to its image in the other tree, the same for the second ‘huhu’ (for which there’s no actual difference):

?- diff(on(//'NEST', @type), from(//bird, @name)).
% no matching NEST in tree2

not found: cuckoos-nest

% owli is one month younger in tree1 compared to tree2

[element(NEST,….,

element(NEST,….,

element(bird,[name=owli,px_order=2.1,type=owl,age=2],[]),

element(bird,[name=owli,px_order=2.1,type=owl,age=3],[])]

true ;

% young huhu is one month younger in tree1 compared to tree2

[element(NEST,….,

element(NEST,….,

element(bird,[name=huhu,px_order=2.1,type=owl,age=3],[]),

element(bird,[name=huhu,px_order=2.1,type=owl,age=4],[])]

true ;

% cucko is lighter in tree1 compared to tree2

[element(NEST,….,

element(NEST,….,

element(bird,[name=cucko,px_order=2.1,type=cuckoo,age=6],[,element(weight,[],[400]),]),

element(bird,[name=cucko,px_order=2.1,type=cuckoo,age=6],[,element(weight,[],[411]),])]

true ;

% jack has left his nest

[element(NEST,….,

element(NEST,….,

element(bird,[name=jack,px_order=3.1,type=sparrow,age=12],[])]

true ;

% junior is born

[element(NEST,....,

element(NEST,….,

element(bird,[name=junior,px_order=3.1,type=sparrow,age=1],[])]

true ;

% there’s a new nest with woody

[eof,

element(NEST,…,

element(bird,[name=woody,age=10],[])]

true ;

false.

Running the Diff as follows:

?- diff(on(//'NEST', @type), from(//bird)).

yields a difference with the preceding output

[element(NEST,…,

element(NEST,…,

element(bird,[name=jack,px_order=3.1,type=sparrow,age=12],[]),

element(bird,[name=junior,px_order=3.1,type=sparrow,age=1],[])]

as jack and junior are compared based on their “position” in the nest.

Conclusion

The extensibility of the last solution presented comes at a price, as it is undoubtedly more complex to comprehend. However its adaptability to tackle various Xml comparison scenarios is essential to make it a useful diff toolkit.

Search This Blog

function_missing # lost in complexity