-
Notifications
You must be signed in to change notification settings - Fork 686
Checking satisfiability of regular expressions
Most modern programming languages support extended regular expressions (regexes) with additional features. Some of these features make it possible to write regular expressions that can never match an input string.
There is no widely adopted standard when it comes to regexes. However, the most adopted features that might render a regex unsatisfiable are:
-
Lookarounds are a
construct that enables the programmer to use assertions. The regular
expression of the lookaround has to match for the assertion to be true, but
it does not consume characters. It can either apply to characters after it,
in which case it is called a lookahead, or to previous characters, in which
case it is called a lookbehind. Lookaround can be positive or negative. If
negative, then their pattern should not match for the assertion to be true.
The usual syntax is
(?=...)
/(?<...)
for positive lookahead/lookbehind and(?!...)
/(?<!...)
for negative lookahead/lookbehind -
Possessive
quantifiers provide
a way to repeat some pattern. In contrast to the lazy/greedy quantifiers, the
regex engine will not backtrack once the pattern has been matched. This
feature adresses practical concerns: it enables to speed up matching and
prevent catastrophic backtracking. The most common syntax is to append a
+
to regular quantifiers. -
Anchors are simply a
shorthand for some common assertions. Usually,
^
and$
assert that the match is at the start (resp. end) of the string. Word boundaries\b
only match if the current character is the end or beginning of a word.
Although other extensions might not cause impossible matching by themselves, we also want to handle them since they interact with the above features. One particularly problematic feature is backreferences.
Classical regular expressions describe the class of languages called regular. This is nice because many properties of such languages are decidable, emptiness included.
Thankfully, most extensions do not empower regular expressions, but instead provide conciseness. For example, lookaround can concisely describe a language to match that would require a much more involved regular expression if they were not available.
The notable exception are backreferences. Regular expressions extended with backreferences describe a broader set of languages. Checking emptiness of languages in this set is not decidable.
This is what is currently used in sonar-analyzer-commons. We construct the equivalent NFA for the lookaround/possessive quantifier subexpression and the expression that follows it. We then check if the intersection/difference of those two languages is empty by running the two automata together symbolically, and backtracking when they cannot match the same character. This is a brute force approach, and requires testing all possible path of one automaton if on negative lookarounds and possessive quantifiers.
Here, regexes are modeled as constraints over an arbitrary string and ask an
SMT solver to find if there exists such a string. The theory used in this case
is called the theory of Unicode
Strings in the
SMTLIB format. Most solvers understand this format. For example, we model the
regex a(?=.b)[^c]*
as:
(declare-fun s1 () String)
(declare-fun s2 () String)
(declare-fun s3 () String)
(declare-fun s4 () String)
(declare-fun s5 () String)
(declare-fun s6 () String)
(assert
(and (str.in_re s5 re.all)
(str.in_re s3 (str.to_re "a"))
(str.in_re s1 (re.++ re.allchar (str.to_re "b")))
(str.prefixof s1 s2)
(str.in_re s4 (re.* (re.inter re.allchar (re.comp (str.to_re "c")))))
(str.in_re s6 re.all)
(= (str.++ s4 s6) s2)))
(check-sat)
In practice, we can use java-smt, a library that provides a unified interface to many smt solvers. However, not all solvers can work with the theory of strings. Only Z3 and CVC4/5 can, and they do not run on the JVM. It means that if this solution was used, it would be necessary to package the correct binaries for one of these solvers.
With this approach, we first convert the regex into an alternating finite automata (AFA). AFAs have 2 types of transitions: universal and existential. In comparison, NFAs only have existential transitions. Universal transitions enable efficient modeling of lookahead and possibly lookbehind. Possessive quantifiers and anchors can also be modeled.
To implement and test...
The evaluation was done using a set of ~7700 regexes extracted from Github repositories. I believe that this dataset is representative of the kind of inputs sonar analyzers have to work with.
Some notable facts about this data: ~96% of the regexes do not contain any construct that can cause emptiness, and as such can be trivially verified as always satisfiable. In the remaining ~300 regexes, only one has actually been found to be impossible to match.
Running NFAs in parallel | Solving string constraints | |
---|---|---|
Positive lookahead | Yes | Yes |
Negative lookahead | Yes | Yes |
Positive lookbehind | No | Yes |
Negative lookbehind | No | Yes |
Nested lookarounds | No | Possible |
Anchors | Yes | Yes |
Possessive quantifiers | Yes | Yes |
Combinations of the above | No | Yes |
Bounded loops | Approx. | Yes |
Running NFAs in parallel | Solving string constraints | |
---|---|---|
Average analysis time | 22.6us | 226ms |
Median analysis time | 5us | 23.5ms |
Worse analysis time | 345us | 3.63s |
FPs (estimated) | 1 | 0 |
FNs | 3 | 3 |
(#1) is the most complete solution. It can easily model all types of lookarounds, all anchors and possessive quantifiers. It also seems possible and somewhat easy to support backreferences, or a least an approximation. The most problematic situation are assertions defined inside a repetition, which apply to characters outside of the repetition subpattern. It requires unrolling the loop, and is not a fully accurate solution.
(#2) is in contrast the least complete approach. The main issue is that it can only take one construct into account at a time. It also lacks support for lookbehinds
(#3) can support all features. I suspect it would be difficult to work with backreferences.
In practice, (#2) approximates a lot. However, since the overwhelming majority of the dataset is satisfiable,
Although the SMT solver is faster in theory, most regexes in the benchmark are not very complex (at least in terms of the constructs they use, some of them can be quite challenging to read). Therefore, the more naive approach has the best performance. And by a several orders of magnitude!
Even if there were more complex cases, it would not really be possible to see the SMT solver wins in term of performance, since
Conclusion: since most inputs are mostly simple, the brute force approach has way better performance.
It seems that a majority of regexes in the wild are not often using advanced features, and are usually pretty simple. Therefore, I think simple approaches should have better performance.
Using string constraints to model the problem feels very natural. Java-smt API is nice to work with and well documented.
Tiny changes in the input can result in huge run time differences. Take for example the input a(?!b*c)
. In partial matching mode, it corresponds to the following constraints:
(declare-fun s2 () String)
(declare-fun s5 () String)
(declare-fun s0 () String)
(declare-fun s3 () String)
(declare-fun s4 () String)
(assert (let ((a!1 (forall ((s1 String))
(! (let ((a!1 (str.in_re s1
(re.++ (re.* (str.to_re "b"))
(str.to_re "c")))))
(=> a!1 (not (str.prefixof s1 s2))))
:weight 0))))
(and (str.in_re s4 re.all)
(str.in_re s3 (str.to_re "a"))
(str.in_re s0 (str.to_re ""))
a!1
(str.in_re s5 re.all)
(= s5 s2))))
It is found to be satisfiable by Z3 in <60ms. However, the slight variation a+(?!b*c)
(declare-fun s2 () String)
(declare-fun s5 () String)
(declare-fun s0 () String)
(declare-fun s3 () String)
(declare-fun s4 () String)
(assert (let ((a!1 (str.in_re s3 (re.++ (str.to_re "a") (re.* (str.to_re "a")))))
(a!2 (forall ((s1 String))
(! (let ((a!3 (str.in_re s1
(re.++ (re.* (str.to_re "b"))
(str.to_re "c")))))
(=> a!3 (not (str.prefixof s1 s2))))
:weight 0))))
(and (str.in_re s4 re.all)
a!1
(str.in_re s0 (str.to_re ""))
a!2
(str.in_re s5 re.all)
(= s5 s2))))
has to be timed out since no solution is found in 20s. I have no good idea why, since the change only impacts the unique constraint on variable s3
. I am guessing it has something to do with the universal quantification (necessary to model negative lookarounds). Maybe the theory of strings has some problems, or I'm just missing something. It would be nice to see if this problem appears when using CVC5 as a backend (when support lands in java-smt). The fact that the solver has to be treated as a black box and that its performance is so uncertain makes it difficult to justify its usage for other purposes.