About more findings on non-determinism of cflow #26

fanweneddie · 2021-10-23T14:10:03Z

I have found a deeper reason for the non-deterministic feature of the original cflow: it is the non-deterministic sequence of generating new taints.

As I showed last time, the non-deterministic output paths are caused by the "unsorted" successor(the open list that contains taints, which is used in dfs) during the paths reconstruction stage. To be specific, "unsorted" is caused by

(1) the original sequence in successor is stochastic and
(2) the stable sort based on string representation of taint can't make successor sorted rigidly.

Therefore, the exploration sequence of taints is non-deterministic, and it can cause different outputs when running cflow for several times.

I said reason (1) is due to the non-deterministic feature of Soot last time. Actually, after digging into Soot, I find that reason (1) is caused by non-deterministic sequence of generating new taints (So Soot should not be blamed). In that case, the successor of a taint may have random sequence before the sort. The non-deterministic sequence of generating new taints is caused by the following two reasons:

1. Non-deterministic sequence of analysis

Firstly, the sequence of analysis is non-deterministic. In original cflow, each analysis depends on both method and entryTaint(Can be viewed as context). In the iteration of analysis(the code is shown below),

while (changed) {
            changed = false;
            logger.info("iter {}", iter);

            for (SootMethod sm : methodList) {
                Body b = sm.retrieveActiveBody();
                Set<Taint> entryTaints = new HashSet<>();
                entryTaints.addAll(methodSummary.get(sm).keySet());
                for (Taint entryTaint : entryTaints) {
                    TaintFlowAnalysis analysis = new TaintFlowAnalysis(b, sourceSinkManager, entryTaint,
                            methodSummary, methodTaintCache, taintWrapper);
                    analysis.doAnalysis();
                    sinks.addAll(analysis.getSinks());
                    changed |= analysis.isChanged();
                }
            }

            iter++;
        }

entryTaints is a HashSet and is iterated over in

for (Taint entryTaint : entryTaints)

without any guarantee of sorting. Therefore, when cflow is run several times, the sequence of analysis for the same method in the same iteration may be different.

This non-determinism can affect the sequence of taint generation. For example, for method foo() with two entry taints: t1 on r1.f1 and r1.f2(f1 and f2 are field reference on r1).

public void foo(Object) {
	Object r1,r2;
	r1 := @parameter0: Object;
	r2 = r1;
	return r2;
}

if the analysis on foo() with entry taint t1 is first scheduled, then new taint on r2.f1 will be generated first. Conversely, if the analysis on foo() with entry taint t2 is first scheduled, then new taint on r2.f2 will be generated first. Therefore, the sequence of taint generation is affected.

2. Non-deterministic sequence of propagating taint

When cflow is propagating taints from set in to set out, it needs to iterate over taints in set in by

for (Taint t : in)

Since in is a HashSet, this iteration does not guarantee the sorted sequence.

In this case, the sequence of generating new taints is also non-deterministic.

Based on two reasons above, I can claim that successor of each taint is non-deterministic. (Note that this problem has nothing to do with Soot, since given a method body and its in-set, on condition that the sequence of iteration is deterministic, Soot will run a work-list algorithm and finally reach a deterministic fixed-point)

By the way, besides the UniqueStmt as I mentioned last time, I have tried different solutions to solve this problem.

I tried to set the class of successor as TreeSet to make it sorted, but I find that it still needs more flow information. For example, in the control flow graph below, invoke statement at label 18 has the same string representation with invoke statement at label 34, and their hash code is non-deterministic(That is, hash code of the first statement may be bigger in the first run, but it may be smaller in the second run). So it cannot compare taints on those statements deterministically.

I also tried to set the class of all in set and out set as TreeSet, and use a TreeSet to store and iterate over the entry taints of each method. In this way, although the sequence of generating new taints is sorted and deterministic, it has expensive cost in adding and removing operation of TreeSet(Takes 8 min to finish analyzing Hadoop_common with -spark option).

Therefore, I still use UniqueStmt to solve this problem of non-determinism, which is easy to implement and has an acceptable performance(Analysis time from 1 min 10 s in original cflow to 1 min 40 s after the revision).

The text was updated successfully, but these errors were encountered:

fanweneddie added the Discussion Something to discuss label Oct 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About more findings on non-determinism of cflow #26

About more findings on non-determinism of cflow #26

fanweneddie commented Oct 23, 2021 •

edited

Loading

About more findings on non-determinism of cflow #26

About more findings on non-determinism of cflow #26

Comments

fanweneddie commented Oct 23, 2021 • edited Loading

1. Non-deterministic sequence of analysis

2. Non-deterministic sequence of propagating taint

fanweneddie commented Oct 23, 2021 •

edited

Loading