You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have found a deeper reason for the non-deterministic feature of the original cflow: it is the non-deterministic sequence of generating new taints.
As I showed last time, the non-deterministic output paths are caused by the "unsorted" successor(the open list that contains taints, which is used in dfs) during the paths reconstruction stage. To be specific, "unsorted" is caused by
(1) the original sequence in successor is stochastic and
(2) the stable sort based on string representation of taint can't make successor sorted rigidly.
Therefore, the exploration sequence of taints is non-deterministic, and it can cause different outputs when running cflow for several times.
I said reason (1) is due to the non-deterministic feature of Soot last time. Actually, after digging into Soot, I find that reason (1) is caused by non-deterministic sequence of generating new taints (So Soot should not be blamed). In that case, the successor of a taint may have random sequence before the sort. The non-deterministic sequence of generating new taints is caused by the following two reasons:
1. Non-deterministic sequence of analysis
Firstly, the sequence of analysis is non-deterministic. In original cflow, each analysis depends on both method and entryTaint(Can be viewed as context). In the iteration of analysis(the code is shown below),
without any guarantee of sorting. Therefore, when cflow is run several times, the sequence of analysis for the same method in the same iteration may be different.
This non-determinism can affect the sequence of taint generation. For example, for method foo() with two entry taints: t1 on r1.f1 and r1.f2(f1 and f2 are field reference on r1).
if the analysis on foo() with entry taint t1 is first scheduled, then new taint on r2.f1 will be generated first. Conversely, if the analysis on foo() with entry taint t2 is first scheduled, then new taint on r2.f2 will be generated first. Therefore, the sequence of taint generation is affected.
2. Non-deterministic sequence of propagating taint
When cflow is propagating taints from set in to set out, it needs to iterate over taints in set in by
for (Taintt : in)
Since in is a HashSet, this iteration does not guarantee the sorted sequence.
In this case, the sequence of generating new taints is also non-deterministic.
Based on two reasons above, I can claim that successor of each taint is non-deterministic. (Note that this problem has nothing to do with Soot, since given a method body and its in-set, on condition that the sequence of iteration is deterministic, Soot will run a work-list algorithm and finally reach a deterministic fixed-point)
By the way, besides the UniqueStmt as I mentioned last time, I have tried different solutions to solve this problem.
I tried to set the class of successor as TreeSet to make it sorted, but I find that it still needs more flow information. For example, in the control flow graph below, invoke statement at label 18 has the same string representation with invoke statement at label 34, and their hash code is non-deterministic(That is, hash code of the first statement may be bigger in the first run, but it may be smaller in the second run). So it cannot compare taints on those statements deterministically.
I also tried to set the class of all in set and out set as TreeSet, and use a TreeSet to store and iterate over the entry taints of each method. In this way, although the sequence of generating new taints is sorted and deterministic, it has expensive cost in adding and removing operation of TreeSet(Takes 8 min to finish analyzing Hadoop_common with -spark option).
Therefore, I still use UniqueStmt to solve this problem of non-determinism, which is easy to implement and has an acceptable performance(Analysis time from 1 min 10 s in original cflow to 1 min 40 s after the revision).
The text was updated successfully, but these errors were encountered:
I have found a deeper reason for the non-deterministic feature of the original
cflow
: it is the non-deterministic sequence of generating new taints.As I showed last time, the non-deterministic output paths are caused by the "unsorted"
successor
(the open list that contains taints, which is used indfs
) during the paths reconstruction stage. To be specific, "unsorted" is caused by(1) the original sequence in
successor
is stochastic and(2) the stable sort based on string representation of taint can't make
successor
sorted rigidly.Therefore, the exploration sequence of taints is non-deterministic, and it can cause different outputs when running
cflow
for several times.I said reason (1) is due to the non-deterministic feature of
Soot
last time. Actually, after digging intoSoot
, I find that reason (1) is caused by non-deterministic sequence of generating new taints (SoSoot
should not be blamed). In that case, the successor of a taint may have random sequence before the sort. The non-deterministic sequence of generating new taints is caused by the following two reasons:1. Non-deterministic sequence of analysis
Firstly, the sequence of analysis is non-deterministic. In original
cflow
, each analysis depends on bothmethod
andentryTaint
(Can be viewed as context). In the iteration of analysis(the code is shown below),entryTaints
is aHashSet
and is iterated over inwithout any guarantee of sorting. Therefore, when
cflow
is run several times, the sequence of analysis for the same method in the same iteration may be different.This non-determinism can affect the sequence of taint generation. For example, for method
foo()
with two entry taints:t1
onr1.f1
andr1.f2
(f1
andf2
are field reference onr1
).if the analysis on
foo()
with entry taintt1
is first scheduled, then new taint onr2.f1
will be generated first. Conversely, if the analysis onfoo()
with entry taintt2
is first scheduled, then new taint onr2.f2
will be generated first. Therefore, the sequence of taint generation is affected.2. Non-deterministic sequence of propagating taint
When
cflow
is propagating taints from setin
to setout
, it needs to iterate over taints in setin
bySince
in
is aHashSet
, this iteration does not guarantee the sorted sequence.In this case, the sequence of generating new taints is also non-deterministic.
Based on two reasons above, I can claim that
successor
of each taint is non-deterministic. (Note that this problem has nothing to do withSoot
, since given a method body and its in-set, on condition that the sequence of iteration is deterministic,Soot
will run a work-list algorithm and finally reach a deterministic fixed-point)By the way, besides the
UniqueStmt
as I mentioned last time, I have tried different solutions to solve this problem.successor
asTreeSet
to make it sorted, but I find that it still needs more flow information. For example, in the control flow graph below, invoke statement at label 18 has the same string representation with invoke statement at label 34, and their hash code is non-deterministic(That is, hash code of the first statement may be bigger in the first run, but it may be smaller in the second run). So it cannot compare taints on those statements deterministically.in
set andout
set asTreeSet
, and use aTreeSet
to store and iterate over the entry taints of each method. In this way, although the sequence of generating new taints is sorted and deterministic, it has expensive cost in adding and removing operation ofTreeSet
(Takes 8 min to finish analyzingHadoop_common
with-spark
option).Therefore, I still use
UniqueStmt
to solve this problem of non-determinism, which is easy to implement and has an acceptable performance(Analysis time from 1 min 10 s in originalcflow
to 1 min 40 s after the revision).The text was updated successfully, but these errors were encountered: