The goal of this feature is to train a neural network to learn how to split identifiers. Indeed, the rule-based TokenParser
is not able to process some of them such as "foobar
" or "methodbase
". The main code is going to be stored in the algorithms directory: Class: NeuralTokenSplitter
. The implementation plan is as follows.
- Use
utils.engine.create_engine
to initialize a spark session and the engine withargs.repositories
as input, siva format. We plan to process the dataset of 150k top starred repositories in any languages. - Code the new class named
CodeExtractor
intransformers.basic
that is going to return both the source code as strings and their languages. - Get the Token stream of identifiers using pygments.highlight with the
RawFormatter
and the language output by enry as a lexer. Filter the token types with a callback. Design the code to easily switch between Babelfish and pygments. - Collect the training dataset
- Select all identifiers splittable on special characters (
[^a-zA-Z]+
) or case changes. Exfoo_bar
andmethodBase
- Use
TokenParser
fromalgorithms.token_parser
to split them, make them lowercase, and join them again. Exfoobar
andmethodbase
- We have X and Y:
foobar
->foo bar
methodbase
->method base
- Select all identifiers splittable on special characters (
- Code a simple neural language model with Keras that relies on character-level inputs as it is described in Character-Aware Neural Language Models to split identifiers.
- Evaluation plots and accuracy improvements by playing with metaparameters. Other ideas:
- Use the context of identifiers
- Use Babelfish and focus on Python/Java languages