Research Notes (DeepLearning4APIs)

This is a whiteboard for research notes. This includes potential challenges, potential solutions and general observations.

Data representation/embedding:

We may need to resolve variables into the things they point to (ie. data: x => data: { … }). This can be done using the static flow analysis of CommitMiner. UPDATE: As proof of concept, CommitMiner now does this for the JSON.stringify example, however, I'm not sure this occurs very often in practice so I'm not going to stop development on this feature for now.
We probably want to figure out how to encode uncommon variable and field names. For example, getProjectID is a variable name with three components. For use in neural machine translation models, it is common to break words up into their constituent parts. For example, it looks like word2vec uses this approach to create word embeddings. In our case, we need an embedding for source code rather than natural language. In our example, we would want to break the variable name up into get@@, @@project@@ and @@id. We may also want to encode some information about the variable's type or the context in which it is being used.

The NMT LSTM network provided by Google uses a linear graph, which doesn't allow us to take advantage of the highly structured nature of a programming language's AST. It looks like the Tree-Structured LSTM network solves this problem, by using a tree structured network, where the input of a LSTM node includes the output of the node's child LSTM nodes. Intuitively, this could be very useful for processing larger source code slices (i.e., ones that include multiple statements).

We need to show that our technique can work for more than one bug pattern, and that it is more effective that other approaches in some metric. At a high level, this method is (1) going to improve the accuracy of pure static analysis when context is needed to figure out what arguments should be included in an API call, and (2) going to be able to detect patterns based on examples, without an analysis specification. The challenge is finding enough examples of incorrect code. We can mutate correct code to create training examples, however, because we have limited compute resources, we still need to figure out what APIs are problematic in practice and what mutations to apply to generate training examples.