-
Notifications
You must be signed in to change notification settings - Fork 14
Primer: XPath Query Evaluation
This document is a primer on XPath evaluation basics to help understand how they are implemented and optimized in CommCare.
At a fundamental and native level an XPath query follows a straightforward pattern.
The expression we're concerned with starts with a reference to data somewhere in the session, which is an expression which refers to one of the session's data models.
/data/child/node[@value = 'filtered']/output
A reference doesn't necessarily refer to a value. A reference is conceptually a way to refer to any number of instance elements named output
which follow the path to get there.
For clarity in the evaluation process it is helpful to distinguish between a reference which refers to an abstract set of nodes (like the one above), and "fully qualified" reference to a single node, like
/data[1]/child[1]/node[4]/output[1]
The process by which the value of "unqualified" reference is requested and is covered to a usable value is what we're referring to here as evaluating a query.
Query evaluation takes an input tree reference like
/data/child/node[@value = 'filtered']/output
and converts it to a nodeset, which is a set of fully qualified references that match that query, like
/data[1]/child[1]/node[1]/output[1]
/data[1]/child[1]/node[4]/output[1]
Often only a single node is a member of the nodeset, in which case the result of a query is the contents of the requested node. In other cases the nodeset will have multiple elements, and are only useful as inputs to functions like count()
, sum()
, or join()
.
In CommCare, XPath expressions are evaluated against an evaluation context, which provides a frame for evaluating a query. Among other things, the context provides
- The relative context of a query - IE: "What should a relative query (
../some_node
) evaluate to right now?" - The "original" context of a query - IE: If we are in a nested query, what should
current()
evaluate to? - The body of instances available for evaluation
- XPath variables (if relevant)
In order to keep state isolated, the operation to "expand" a reference into a nodeset of fully qualified references is performed within an evaluation context.
Any time a new context is needed (for instance, inside of a predicate expression where ../some_node
needs to refer to a new relative reference), a new derivative evaluation context is created and used to expand and evaluate the new expressions.
The native process for performing a query is to walk each "step" of the provided query depth-first against the live instance, and including a fully qualified reference to each element which matches.
To evaluate the XPath expression
/data/state[@population > 1000]/city[@type='city'][@capital = 'yes']/name
against a sample body
<data>
<state population="1500">
<city capital="no" type="town">
<name>Woburn</name>
</city>
<city capital="yes" type="city">
<name>Boston</name>
</city>
<city capital="no" type="city">
<name>Cambridge</name>
</city>
</state>
<state population="500">
<city capital="no" type="city">
<name>Albuquerque</name>
</city>
<city capital="yes" type="city">
<name>Santa Fe</name>
</city>
</state>
<state population="2000">
<city capital="yes" type="city">
<name>Albany</name>
</city>
<city capital="no" type="city">
<name>New York</name>
</city>
</state>
</data>
we start at the root (/
), and look for matches. Since we find one (data
), we continue processing with just that node as our working set
/data[1]/
On the next step state[@population > 1000]
we need to both check for children with matching names, and also evaluate a filter condition.
First the engine generates a candidate set of references which match the name pattern
/data[1]/state[1]
/data[1]/state[2]
/data[1]/state[3]
Then the engine identifies which of the candidate references is compatible with the filter condition, striking out any references which fail to meet the filter
/data[1]/state[1]/@population > 1000 = true
/data[1]/state[2]/@population > 1000 = false
/data[1]/state[3]/@population > 1000 = true
Once all candidate references are evaluated, we move on to the next step with our current working set
Current Working Set:
/data[1]/state[1]/
/data[1]/state[3]/
Remaining query to evaluate:
city[@type='city'][@capital = 'yes']/name
We'll evaluate each remaining step in the reference (city[@type='city'][@capital = 'yes']/name
) against this working set. If at any point it is empty, an empty nodeset is returned and evaluation stops.
To evaluate the next step city[@type='city'][@capital = 'yes']
, the engine once again generates a candidate set of matching references, based on the current instance.
/data[1]/state[1]/city[1]
/data[1]/state[1]/city[2]
/data[1]/state[1]/city[3]
/data[1]/state[3]/city[1]
/data[1]/state[3]/city[2]
and once again processes the filter expressions against those candidate references.
Predicates are evaluated depth-first and only if the previous predicate passed for that reference.
/data[1]/state[1]/city[1]/@type = 'town' > false
/data[1]/state[1]/city[2]/@type = 'city' > true
/data[1]/state[1]/city[2]/@capital = 'yes' > true
/data[1]/state[1]/city[3]/@type = 'city' > true
/data[1]/state[1]/city[3]/@capital = 'no' > false
/data[1]/state[3]/city[1]/@type = 'city' > true
/data[1]/state[3]/city[1]/@capital = 'yes' > true
/data[1]/state[3]/city[2]/@type = 'city' > true
/data[1]/state[3]/city[2]/@capital = 'yes' > false
moving forward with the resulting working nodeset below
Current Working Set:
/data[1]/state[1]/city[2]
/data[1]/state[3]/city[1]
Remaining query to evaluate:
name
Finally, the last step (name
) is applied to each working reference, resulting in the final nodeset
/data[1]/state[1]/city[2]/name[1]
/data[1]/state[3]/city[1]/name[1]
which would produce an error if referenced alone, IE:
<output value="/data/state[@population > 1000]/city[@type='city'][@capital = 'yes']/name"/>
but can be used with an expression like
<output value="join(", ", /data/state[@population > 1000]/city[@type='city'][@capital = 'yes']/name)"/>
to produce
Boston, Albany