-
Notifications
You must be signed in to change notification settings - Fork 14
Primer: XPath Query Evaluation
This document is a primer on XPath evaluation basics to help understand how they are implemented and optimized in CommCare.
At a fundamental and native level an XPath query follows a straightforward pattern.
The expression we're concerned with starts with a reference to data somewhere in the session, which is an expression which refers to one of the session's data models.
/data/child/node[@value = 'filtered']/output
A reference doesn't necessarily refer to a value. A reference is conceptually a way to refer to any number of instance elements named output
which follow the path to get there.
For clarity in the evaluation process it is helpful to distinguish between a reference which refers to an abstract set of nodes (like the one above), and "fully qualified" reference to a single node, like
/data[1]/child[1]/node[4]/output[1]
The process by which the value of "unqualified" reference is requested and is covered to a usable value is what we're referring to here as evaluating a query.
Query evaluation takes an input tree reference like
/data/child/node[@value = 'filtered']/output
and converts it to a nodeset, which is a set of fully qualified references that match that query, like
/data[1]/child[1]/node[1]/output[1]
/data[1]/child[1]/node[4]/output[1]
Often only a single node is a member of the nodeset, in which case the result of a query is the contents of the requested node. In other cases the nodeset will have multiple elements, and are only useful as inputs to functions like count()
, sum()
, or join()
.
The native process for performing a query is to walk each "step" of the provided query depth-first against the live instance, and including a fully qualified reference to each element which matches.
To evaluate the XPath expression
/data/state[@population > 1000]/city[@capital = 'yes']/name
against a sample body
<data>
<state population="1500">
<city capital="no">
<name>Framingham</name>
</city>
<city capital="yes">
<name>Boston</name>
</city>
<city capital="no">
<name>Cambridge</name>
</city>
</state>
<state population="500">
<city capital="no">
<name>Albuquerque</name>
</city>
<city capital="yes">
<name>Santa Fe</name>
</city>
</state>
<state population="2000">
<city capital="yes">
<name>Albany</name>
</city>
<city capital="no">
<name>New York</name>
</city>
</state>
</data>
we start at the root (/
), and look for matches. Since we find one (data
), we continue processing with just that node as our working set
/data[1]/
On the next step state[@population > 1000]
we need to both check for children with matching names, and also evaluate a filter condition.
First the engine generates a candidate set of references which match the name pattern
/data[1]/state[1]
/data[1]/state[2]
/data[1]/state[3]
Then the engine identifies which of the candidate references is compatible with the filter condition, striking out any references which fail to meet the filter
/data[1]/state[1]/@population > 1000 = true
/data[1]/state[2]/@population > 1000 = false
/data[1]/state[3]/@population > 1000 = true
Once all candidate references are evaluated, we move on to the next step with our current working set
/data[1]/state[1]/
/data[1]/state[3]/
We'll evaluate each remaining step in the reference (city[@capital = 'yes']/name
) against this working set. If it any point it is empty, an empty nodeset is returned and evaluation stops.
To evaluate the next step city[@capital = 'yes']
, the engine once again generates a candidate set of matching references, based on the current instance.
/data[1]/state[1]/city[1]
/data[1]/state[1]/city[2]
/data[1]/state[1]/city[3]
/data[1]/state[3]/city[1]
/data[1]/state[3]/city[2]
and once again processes the filter expressions against those candidate references
/data[1]/state[1]/city[1]/@capital = 'yes' > false
/data[1]/state[1]/city[2]/@capital = 'yes' > true
/data[1]/state[1]/city[3]/@capital = 'no' > false
/data[1]/state[3]/city[1]/@capital = 'yes' > true
/data[1]/state[3]/city[2]/@capital = 'yes' > false
returning the resulting working nodeset of
/data[1]/state[1]/city[2]
/data[1]/state[3]/city[1]
Finally, the last step (name
) is applied to each working reference, resulting in the final nodeset
/data[1]/state[1]/city[2]/name[1]
/data[1]/state[3]/city[1]/name[1]
which would produce an error if referenced alone, IE:
<output value="/data/state[@population > 1000]/city[@capital = 'yes']/name"/>
but can be used with an expression like
<output value="join(", ", /data/state[@population > 1000]/city[@capital = 'yes']/name)"/>
to produce
Boston, Albany