Skip to content

Primer: XPath Query Evaluation

ctsims edited this page Mar 29, 2017 · 10 revisions

This document is a primer on XPath evaluation basics to help understand how they are implemented and optimized in CommCare.

XPath Query Evaluation

At a fundamental and native level an XPath query follows a straightforward pattern.

The expression we're concerned with starts with a reference to data somewhere in the session, which is an expression which refers to one of the session's data models.

/data/child/node[@value = 'filtered']/output

A reference doesn't necessarily refer to a value. A reference is conceptually a way to refer to any number of instance elements named output which follow the path to get there.

For clarity in the evaluation process it is helpful to distinguish between a reference which refers to an abstract set of nodes (like the one above), and "fully qualified" reference to a single node, like

/data[1]/child[1]/node[4]/output[1]

The process by which the value of "unqualified" reference is requested and is covered to a usable value is what we're referring to here as evaluating a query.

Query evaluation takes an input tree reference like

/data/child/node[@value = 'filtered']/output

and converts it to a nodeset, which is a set of fully qualified references that match that query, like

/data[1]/child[1]/node[1]/output[1]
/data[1]/child[1]/node[4]/output[1]

Often only a single node is a member of the nodeset, in which case the result of a query is the contents of the requested node. In other cases the nodeset will have multiple elements, and are only useful as inputs to functions like count(), sum(), or join().

Query Evaluation Process Example

The native process for performing a query is to walk each "step" of the provided query depth-first against the live instance, and including a fully qualified reference to each element which matches.

To evaluate the XPath expression

/data/state[@population > 1000]/city[@capital = 'yes']/name

against a sample body

<data>
  <state population="1500">
    <city capital="no">
      <name>Framingham</name>
    </city>
    <city capital="yes">
      <name>Boston</name>
    </city>
  </state>
  <state population="500">
    <city capital="no">
      <name>Albuquerque</name>
    </city>
    <city capital="yes">
      <name>Santa Fe</name>
    </city>
  </state>
  <state population="2000">
    <city capital="yes">
      <name>Albany</name>
    </city>
    <city capital="no">
      <name>New York</name>
    </city>
  </state>
</data>

we start at the root (/), and look for matches. Since we find one (data), we continue processing.

On the next step state[@population > 1000] we need to both check for children with matching names, and also evaluate a filter condition.

First the engine generates a candidate set of references which match the name pattern

/data[1]/state[1]
/data[1]/state[2]
/data[1]/state[3]

Then the engine identifies which of the candidate references is compatible with the filter condition, striking out any references which fail to meet the filter

/data[1]/state[1]/@population > 1000 = true
/data[1]/state[2]/@population > 1000 = false
/data[1]/state[3]/@population > 1000 = true

Once all candidate references are evaluated, we move on to the next step with our current working set

/data[1]/state[1]/
/data[1]/state[3]/

We'll evaluate each remaining step in the reference (city[@capital = 'yes']/name) against this working set. If it any point it is empty, an empty nodeset is returned and evaluation stops.

To evaluate the next step city[@capital = 'yes'], the engine once again generates a candidate set of matching references, based on the current instance.

/data[1]/state[1]/city[1]
/data[1]/state[1]/city[2]
/data[1]/state[3]/city[1]
/data[1]/state[3]/city[2]

and once again processes the filter expressions against those candidate references

/data[1]/state[1]/city[1]/@capital = 'yes' > false
/data[1]/state[1]/city[2]/@capital = 'yes' > true
/data[1]/state[3]/city[1]/@capital = 'yes' > true
/data[1]/state[3]/city[2]/@capital = 'yes' > false

returning the resulting working nodeset of

/data[1]/state[1]/city[2]
/data[1]/state[3]/city[1]

Finally, the last step (name) is applied to each working reference, resulting in the final nodeset

/data[1]/state[1]/city[2]/name[1]
/data[1]/state[3]/city[1]/name[1]

which would produce an error if referenced alone, IE:

<output value="/data/state[@population > 1000]/city[@capital = 'yes']/name"/>

but can be used with an expression like

<output value="join(", ", /data/state[@population > 1000]/city[@capital = 'yes']/name)"/>

to produce

Boston, Albany