Meaningful Xml Diffs with SWI-Prolog

Introduction

Building on the previous article I wrote on using the SWI Prolog Xpath predicate for analyzing Xml Node Set relationships, I now describe a way to produce a tree oriented diff based on similar techniques.

The goal of such a diff is to identify objects of 2 Xml trees which differ by a given sub-node.

In a sense is that diff not a fully general diff, as it focuses on 2 or 3 levels:

object
sub-or-descendant node resp. property
when required, property of the sub-or-descendant node

specified by the user, within the Xml tree structure.

Another important goal of this diff is to be able to accommodate a different ordering of these objects in the Xml structure, which is normally a constraint which most editors or developer tools can’t cope with.

In the following sections I will again use a similar fictitious tree example as in the precedent article.

Resources for this article

Prolog source code and XML examples are available on [github], and are linked to this article.

To help enhance the readability of the predicates’ output, some Xml elements below are sometimes “ellipsed” with a “…” .

Specifying a diff of nests for which birds elements are changed

Birds change events in this example will be:

a bird has a changed attribute – e.g. age or weight (this is a recursive definition which applies on child or descendant nodes)
a bird is new in the nest, given the same nest in the original tree
a bird has left the nest, compared to the same nest in the original tree

Fallen or picked up oranges are ignored.

To describe this diff I will use queries such as the following being applied to the Xml original (tree1) and modified (tree2) trees:

diff(on(//’NEST’,@type), from(//bird,@name))

diff(on(//’NEST’,@type), from(//bird))

diff(on(//’NEST’), from(//bird))

with the following semantics:

match the nests in the original and modified Xml source, using their type property resp. their root tag & attributes
match the birds for each of the matched nests, using their name property resp. their root tag & attributes
show common birds with some differences, and new or deleted birds

In this example is the following output expected:

?- diff(on(//'NEST', @type), from(//bird, @name)).

% no matching NEST in tree2

not found: cuckoos-nest

N1 = element('NEST', [type='cuckoos-nest'], []) ;

% owli is one month younger in tree1 compared to tree2

N1 = element('NEST', [type='owl-nest'], …,

N2 = element('NEST', [type='owl-nest'], …,

B = element(bird, [name=owli, type=owl, age='2'], []) ;

% cucko is lighter in tree1 compared to tree2

N1 = element('NEST', [type='owl-nest'], …,

N2 = element('NEST', [type='owl-nest'], …,

B = element(bird, [name=cucko, type=cuckoo, age='6'], [element(weight, [], ['400'])]) ;

% owli is one month older in tree2 compared to tree1

N1 = element('NEST', [type='owl-nest'], …,

N2 = element('NEST', [type='owl-nest'], …,

B = element(bird, [name=owli, type=owl, age='3'], []) ;

% cucko is heavier in tree2 compared to tree1

N1 = element('NEST', [type='owl-nest'], …,

N2 = element('NEST', [type='owl-nest'], …,

B = element(bird, [name=cucko, type=cuckoo, age='6'], [element(weight, [], ['411'])]) ;

% jack has left his nest

N1 = element('NEST', [type='sparrow-nest'], …,

N2 = element('NEST', [type='sparrow-nest'], …,

B = element(bird, [name=jack, type=sparrow, age='12'], []) ;

% junior is born

N1 = element('NEST', [type='sparrow-nest'], …,

N2 = element('NEST', [type='sparrow-nest'], …,

B = element(bird, [name=junior, type=sparrow, age='1'], []) ;

% there’s a new nest with woody

N2 = element('NEST', [type='woodpecker-nest'], …,

B = element(bird, [name=woody, age='10'], []) ;

false.

Implementing the diff solution based on SWI-Prolog Xpath predicates

The above output can be calculated thanks to the tree_d2_map predicate, as follows:

% diff(on(//’NEST’,@type), from(//bird,@name))

?- tree_d2_map(//'NEST', /self(@type), //bird, /self(@name), N1, N2, B).

% diff(on(//’NEST’,@type), from(//bird))

?- tree_d2_map(//'NEST', /self(@type), //bird, /self, N1, N2, B).

This predicate is using unification on the U1 variable, to match tree1 and tree2 NEST nodes,

as 2 synchronized Node Sets referenced by the SE1 and SE2 variables.

3 strategies came to my mind, for comparing the list of sub-nodes (birds) for the matched nests:

using the “longuest common subsequence” a.k.a. “LCS” to handle the case where some identical sub-nodes are present for a given nest … I realized that this might not be the most meaningful technique to use in the context of the tree based diff, as some intermediary nodes might play a role in the distribution resp. grouping of the sub-nodes, so LCS applied on lists of sub-nodes are not in every case adequate

building xpath-like expressions representing the navigation path from matched objects down to their sub-nodes, and unify these paths to match sub-nodes of tree 1 and 2 ; this strategy will be investigated in a next article, along with an interesting way to structure the code for several diff strategies

using a map (bird_id or bird_tag => bird node) to limit comparisons to nodes with the same “identity”

this latter solution is used in tree_d2_map and shows a significantly better performance than with the LCS solution (tree_d2_lcs) when applied on really big Xml files.

This map is based on Prolog dynamic predicates…foundbird1 and foundbird2. Their key can be either:

an atom, such as ‘owli’,
or an element e.g. element(bird, [name=woody, age='10'], [])

depending on the xpath expression passed to tree_d2_map.

The map is actually initialized by a side effect, using a backtracking expression encapsulated by an ignore predicate, equivalent to a Prolog findall.

Another important aspect with Prolog’s pattern matching is that one can very simply compare sub-trees for equality, using e.g.:

B \= B2

which allows one to use write/reuse very general solutions with “node as value” comparisons.

Implementing the diff predicate

Right now the diff predicate is simply written as a paraphrase of tree_d2_map.

In a next article I will demonstrate the use of SWI Prolog shift/reset to implement a more generic diff based on a reusable code structure that works with multiple Node Set match strategies.

Search This Blog

function_missing # lost in complexity