Meaningful Xml Diffs with SWI-Prolog

Introduction

Building on the previous article I wrote on using the SWI Prolog Xpath predicate for analyzing Xml Node Set relationships, I now describe a way to produce a tree oriented diff based on similar techniques.

The goal of such a diff is to identify objects of 2 Xml trees which differ by a given sub-node.
In a sense is that diff not a fully general diff, as it focuses on 2 or 3 levels:
  • object
  • sub-or-descendant node resp. property
  • when required, property of the sub-or-descendant node
specified by the user, within the Xml tree structure.

Another important goal of this diff is to be able to accommodate a different ordering of these objects in the Xml structure, which is normally a constraint which most editors or developer tools can’t cope with.

In the following sections I will again use a similar fictitious tree example as in the precedent article.

Resources for this article

Prolog source code and XML examples are available on [github], and are linked to this article.

To help enhance the readability of the predicates’ output, some Xml elements below are sometimes “ellipsed” with a “…” .


Specifying a diff of nests for which birds elements are changed 


Birds change events in this example will be:
  • a bird has a changed attribute – e.g. age or weight (this is a recursive definition which applies on child or descendant nodes)
  • a bird is new in the nest, given the same nest in the original tree
  • a bird has left the nest, compared to the same nest in the original tree
Fallen or picked up oranges are ignored.

To describe this diff I will use queries such as the following being applied to the Xml original (tree1) and modified (tree2) trees:

diff(on(//’NEST’,@type), from(//bird,@name))
diff(on(//’NEST’,@type), from(//bird))
diff(on(//’NEST’), from(//bird))

with the following semantics:
  • match the nests in the original and modified Xml source, using their type property resp. their root tag & attributes
  • match the birds for each of the matched nests, using their name property resp. their root tag & attributes
  • show common birds with some differences, and new or deleted birds
In this example is the following output expected:

?- diff(on(//'NEST', @type), from(//bird, @name)).
% no matching NEST in tree2
not found: cuckoos-nest
N1 = element('NEST', [type='cuckoos-nest'], []) ;
% owli is one month younger in tree1 compared to tree2
N1 = element('NEST', [type='owl-nest'], …,
N2 = element('NEST', [type='owl-nest'], …,
B = element(bird, [name=owli, type=owl, age='2'], []) ;
% cucko is lighter in tree1 compared to tree2
N1 = element('NEST', [type='owl-nest'], …,
N2 = element('NEST', [type='owl-nest'], …,
B = element(bird, [name=cucko, type=cuckoo, age='6'], [element(weight, [], ['400'])]) ;
% owli is one month older in tree2 compared to tree1
N1 = element('NEST', [type='owl-nest'], …,
N2 = element('NEST', [type='owl-nest'], …,
B = element(bird, [name=owli, type=owl, age='3'], []) ;
% cucko is heavier in tree2 compared to tree1
N1 = element('NEST', [type='owl-nest'], …,
N2 = element('NEST', [type='owl-nest'], …,
B = element(bird, [name=cucko, type=cuckoo, age='6'], [element(weight, [], ['411'])]) ;
% jack has left his nest
N1 = element('NEST', [type='sparrow-nest'], …,
N2 = element('NEST', [type='sparrow-nest'], …,
B = element(bird, [name=jack, type=sparrow, age='12'], []) ;
% junior is born
N1 = element('NEST', [type='sparrow-nest'], …,
N2 = element('NEST', [type='sparrow-nest'], …,
B = element(bird, [name=junior, type=sparrow, age='1'], []) ;
% there’s a new nest with woody
N2 = element('NEST', [type='woodpecker-nest'], …,
B = element(bird, [name=woody, age='10'], []) ;
false.

Implementing the diff solution based on SWI-Prolog Xpath predicates


The above output can be calculated thanks to the tree_d2_map predicate, as follows:

% diff(on(//’NEST’,@type), from(//bird,@name))
?- tree_d2_map(//'NEST', /self(@type), //bird, /self(@name), N1, N2, B).

% diff(on(//’NEST’,@type), from(//bird))
?- tree_d2_map(//'NEST', /self(@type), //bird, /self, N1, N2, B).

This predicate is using unification on the U1 variable, to match tree1 and tree2 NEST nodes,
as 2 synchronized Node Sets referenced by the SE1 and SE2 variables.

3 strategies came to my mind, for comparing the list of sub-nodes (birds) for the matched nests:
  • using the “longuest common subsequence” a.k.a. “LCS” to handle the case where some identical sub-nodes are present for a given nest … I realized that this might not be the most meaningful technique to use in the context of the tree based diff, as some intermediary nodes might play a role in the distribution resp. grouping of the sub-nodes, so LCS applied on lists of sub-nodes are not in every case adequate
  • building xpath-like expressions representing the navigation path from matched objects down to their sub-nodes, and unify these paths to match sub-nodes of tree 1 and 2 ; this strategy will be investigated in a next article, along with an interesting way to structure the code for several diff strategies
  • using a map (bird_id or bird_tag => bird node) to limit comparisons to nodes with the same “identity”
this latter solution is used in tree_d2_map and shows a significantly better performance than with the LCS solution (tree_d2_lcs) when applied on really big Xml files.

This map is based on Prolog dynamic predicates…foundbird1 and foundbird2. Their key can be either:
  • an atom, such as ‘owli’, 
  • or an element e.g. element(bird, [name=woody, age='10'], [])
depending on the xpath expression passed to tree_d2_map.

The map is actually initialized by a side effect, using a backtracking expression encapsulated by an ignore predicate, equivalent to a Prolog findall.

Another important aspect with Prolog’s pattern matching is that one can very simply compare sub-trees for equality, using e.g.:


which allows one to use write/reuse very general solutions with “node as value” comparisons.

Implementing the diff predicate


Right now the diff predicate is simply written as a paraphrase of tree_d2_map.

In a next article I will demonstrate the use of SWI Prolog shift/reset to implement a more generic diff based on a reusable code structure that works with multiple Node Set match strategies.

Comments

Popular posts from this blog

A possible solution for using the janus interface on MacOS with docker

Prolog DCG "distilled" - Part 3 - Capturing and assembling textual elements from rules