Research Notes (DeepLearning4APIs)

This is a whiteboard for research notes. This includes potential challenges, potential solutions and general observations.

Data representation/embedding:

We may need to resolve variables into the things they point to (ie. data: x => data: { … }). This can be done using the static flow analysis of CommitMiner.
We probably want to figure out how to encode uncommon variable and field names. For example, getProjectID is a variable name with three components. For use in neural machine translation models, it is common to break words up into their constituent parts. For example, it looks like word2vec uses this approach to create word embeddings. In our case, we need an embedding for source code rather than natural language. In our example, we would want to break the variable name up into get@@, @@project@@ and @@id. We may also want to encode some information about the variable's type or the context in which it is being used.

The NMT LSTM network provided by Google uses a linear graph, which doesn't allow us to take advantage of the highly structured nature of a programming language's AST. It looks like the Tree-Structured LSTM network solves this problem, by using a tree structured network, where the input of a LSTM node includes the output of the node's child LSTM nodes. Intuitively, this could be very useful for processing larger source code slices (i.e., ones that include multiple statements).