Gnome-based transpiling, a way towards dialect compatibility #23

gabrielesilinic · 2023-05-02T13:57:15Z

gabrielesilinic
May 2, 2023
Collaborator

Gnome-based transpiling, or simply Gnomes/Otterkit Gnomes, will be an Otterkit feature meant to help the Otterkit compiler deal with either non-standard or legacy COBOL codebases. Gnomes are meant to transpile sections of code on callback. This new kind of architecture sprung from a comment Gabriel (@KTSnowy) asked us to make about supporting non-standard compiler directives initiated from a discussion between him and Simon (@GitMensch) from GnuCobol. So, I invented gnomes and refined the idea with Gabriel.

What can a gnome do?

Imagine gnomes as little knowledgeable workers that help the Otterkit COBOL compiler deal with what it is currently unable to handle.

Many gnomes can live in a single extension and each gnome must cover a full statement or almost any kind of quirky extension added to the dialect. The gnomes' job is to transpile old or non standard COBOL to modern and standard COBOL (COBOL 2023 is a full-fledged general-purpose programming language, so it should be capable of dealing with those kinds of things, unlike its predecessors).

This way the gnomes could also be used to ease the migration from legacy codebases to modern COBOL.

Thanks to the framework we will introduce, each gnome will have some level of standard behavior and tooling. Each gnome can declare their Otterkit COBOL library dependencies to add to the current Otterkit COBOL project, which will probably make heavy use of our C# interop feature. For this reason, gnomes will be implemented as one of the last features in the Otterkit ecosystem, but I deemed important to still discuss about them because they will play a decently important role towards the solution to a problem COBOL is affected by: an insane amount of compilers that love to take trips away from standard COBOL and sometimes even have missing standard COBOL features.

A gnome will be able to hook and be called back in various stages (preprocessing or plaintext, lexing, parsing). The gnome will have to determine the section of code it intends to replace and will define all COBOL Otterkit code using an API to declare the various statements and function calls as objects, thus avoiding the generation of illegal statements. Also, this way, not only will we reduce the possibility of a gnome misbehaving, but we will also know that if a specific section cannot run or doesn't get parsed, it is a specific gnome's fault. We could also make the gnome run as a different thread and terminate it if it doesn't return a result in a reasonable time, thus avoiding to have to deal with badly written gnomes.

Gnome callback?

As we all could imagine, a single dialect will be made of many, many gnomes—from 100 to possibly 500 gnomes. We don't know for sure, but what we do know is that whenever possible, it's best to let them rest, ready for when they will be able to actually do the specific job they are very good at. For this reason, we determined that it's best to define a gnome's hook. I thought about regex myself at first since it's the universal way to analyze text, though I knew it was not the only way. So, Gabriel suggested token-based gnome callbacks when a gnome wishes to be called back on the lexer level, while I suggested an OnError kind of gnome that combines the parser erroring out with a selector. In this last case, the job of the gnome is to fix the code so the parser can keep going.

Note: The callback mechanism is just an optimization. It will use simple enough selectors; the gnome itself must perform the final check. If it finds that that instance of callback was a false positive, it must let Otterkit know by returning a specific value (to be defined), so Otterkit can give the task to the next candidate if present.

The power of a gnome

I personally think our gnomes should be very powerful. A parser gnome should be able to read a whole file and let the parser discard the last bits of statements that tried to parse but somehow failed. Gabriel, however, suggested we should actually try to determine the next statement and restrict the scope of a gnome. I argued that doing so could introduce many issues since we can never be sure about how even the current statement looks like. Imagine an alternative way to do a multi-line comment, much like SQL injections work. An alternatively commented display statement could introduce problems since we don't know how the statement looks like and if it contains things resembling normal COBOL statements.

Gnomes already introduce security issues by being able to add their own dependencies, so it probably doesn't matter if we let them be a bit freer. It's unfortunately almost impossible to correctly sandbox gnomes due to the role they have to fulfill, though we could introduce lower-level gnomes to at least make it easy to add common things dialects often like to arbitrarily add, such as built-in functions. Those gnomes would get much simpler selectors and would automatically point, for example, to a user-defined function.

The tools of a gnome

I determined that we cannot leave gnomes alone, as each gnome implementor making their own thing would result in a huge mess, much like what happened with C strings. So, we know we have to add some tools. Unfortunately, right now, we are not sure what a gnome may need except for one thing: name resolution. So, I will explain what we came up with so far.

A gnome's name resolution is a name resolution database dedicated to the gnomes. We cannot compromise the stability of our parser, but we recognize that gnomes and their developers may find it handy to have their own name resolution database ready to go. A gnome's name resolution database will be extremely similar to the internal name resolution database, which gnomes have read-only access to. Gnomes may define the scope a name is defined in, and in some scopes, they will be able to let the Otterkit parser read from the gnome name resolution and consider any name in there as perfectly valid. The scope may also be gnome-only, gnome of the same extensions, or all gnomes, depending on the use case. The callback will take care to properly hide names from the gnome name resolution database.

GitMensch · 2023-05-02T14:39:20Z

GitMensch
May 2, 2023

This sounds like a good approach (just recognize that common things will likely be resolved no only by UDF but by CALL, but that's up to the gnome implementor and the task at hand).

Historically many COBOL "shops" use their own macro processors - do you see it as possible that a gnome "down in PROCEDURE DIVISION can also define that a copybook must be inserted into `WORKING STORAGE´? Also that it only must be included if it isn't already in (manually or by another gnome)?
If the answer is "no" then that's still ok, just wondering what kind of macro processors may be replaced by gnomes.

Concerning the "parser error registration" - at least when we get to macros a "dual" approach that includes a regex option would be necessary.
I'm not sure but possibly there's also a dual option necessary for things like "D in column 7"?

7 replies

GitMensch May 2, 2023

As mentioned above: it may be important to automatically add entries in WORKING-STORAGE, doing so with telling otterkit to "now parse this COBOL source file and add it to the program" would be useful.

For multi-enabling gnomes one could explicit specify that the regex approach should only be used if the token one isn't enough.
But actually if a gnome defines a regex then otterkit can compile that once and just re-use that. I've found the performance of that to be commonly "good enough"while non-pre-compiled regex is too slow (experience in C and Java).

KTSnowy May 2, 2023
Maintainer

it may be important to automatically add entries in WORKING-STORAGE, doing so with telling otterkit to "now parse this COBOL source file and add it to the program" would be useful.

We could probably do this through a preprocessor hook for gnomes, to avoid backtracking into the preprocessor and lexer during parsing.

We can enable gnomes to insert the contents of a copybook file (or a hardcoded multiline string in the gnome code) into specific places like the sections in the data division, or inside a certain section or paragraph name in the procedure division (name specified by the gnome implementor).

This can be mostly done at the preprocessor stage (for the data division), and would enable gnomes to use the same code Otterkit uses to handle copybooks but without having to write a COPY statement.

Maybe a "before . . . section" hook that can be used to insert new code before Otterkit continues parsing it?

GitMensch May 2, 2023

The gnome would only be activated "down in PROCEDURE DIVISION" so that won't work as-is; but the gnome could, if the API allows this, check for the copybook and if it isn't in tell Otterkit to insert it and re-run the preprocess state (on second call it will then find everything be setup and won't trigger the "get back").

gabrielesilinic May 2, 2023
Collaborator Author

@KTSnowy the only "issue" Simon was probably talking about is that in order to implement its features a gnome will need to add its stuff to the working storage and possibly some files as well, we don't have to do anything fancy, just add a dedicated API that defines the data it has to add, then you can probably just inject it into the syntax tree without thinking about parsing, the only thing is that I'd like gnomes to support a mode where they just transpile instead of compiling, but it's probably not that terribly difficult to rebuild code from the syntax tree anyway

KTSnowy May 2, 2023
Maintainer

but the gnome could, if the API allows this, check for the copybook and if it isn't in tell Otterkit to insert it and re-run the preprocess state

The gnome will have to know where the section to insert is defined in the syntax tree. So it can jump to it and avoid traversing all of the AST again to find the section (and then back to the procedure).

We should add some internal properties to the parser that keeps track of the position of these sections (like the data division ones) in the AST for the current callable source unit. Then expose these in a safe way with the gnomes API.

I'd like gnomes to support a mode where they just transpile instead of compiling, but it's probably not that terribly difficult to rebuild code from the syntax tree anyway

Yeah this feature won't be too difficult I think, we already do it with copybooks. It technically transpiles a COPY statement into source code from a copybook (removing the COPY statement).

So we could write similar code to make a gnome replace a non-standard snippet of COBOL (defined by the gnome implementor) with a standard equivalent during parsing.

Only issue could be figuring out an API for people to use this safely from a gnome without accidentally corrupting or breaking the AST.

GitMensch · 2023-05-02T16:48:11Z

GitMensch
May 2, 2023

Just mentioning: you may want to come up with a vscode compatible language-server based on Otterkit which would also handle the gnomes...
Note that these language servers can also be used with other editors like VIM, Emacs, ...

4 replies

KTSnowy May 2, 2023
Maintainer

We will, we have a VSCode extension but right now it doesn't have a language server (or COBOL syntax highlighting). Currently it's only used for handling the package.otterproj project files.

These are basically JSON files that keep track of a COBOL project's settings and dependencies.

Otterkit will read the file to fetch build settings if they haven't been passed as command line arguments. So you can run otterkit build without having to specify the entry point and source format every time.

Oh also, we might need your feedback and suggestions on how the user facing part of the gnomes should look it.

For example, should we expose C# APIs to write the gnomes? Or should we expose a sort of COBOL-like DSL that Otterkit can precompile for the gnomes?

We're not sure which way would be more "ergonomic" for the COBOL ecosystem in general or which one people would prefer.

GitMensch May 2, 2023

I guess a .NET API with examples and empty skeletons will likely best match Otterkit. But I'm not a common COBOL user :-)

Best would be to do some POC (actually pseudo code would be enough) with some variants then ask users - you may also use the GnuCOBOL Lounge for that.

gabrielesilinic May 2, 2023
Collaborator Author

@KTSnowy since we wanted to make COBOL modern and standard do we REALLY need to make and maintain a dialect of it ONLY for gnomes making the process of making gnomes decently uncomfortable since companies and people will probably won't be bothered to learn such a domain specific language that does not even have a very nice syntax? Also but we will make our job also more difficult implementation wise, in the end it will still be C# because it cannot become JSON due to the complexity of a gnome

Like, it's fine and reasonable if we offer COBOL bindings for making a gnome, but don't make gnomes another COBOL dialect really unless we have to or if we find a significant advantage in doing so

KTSnowy May 2, 2023
Maintainer

Yeah that's true. We can make them in C# and also make COBOL bindings available (With the interop library later). Only issue would be finding a way to load them, hmmmm.

That message passing system is looking like a possible solution to "dynamically loading gnomes". Just make them a COBOL message server with a strict defined API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Otterkit

Gnome-based transpiling, a way towards dialect compatibility #23

{{title}}

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Otterkit

Gnome-based transpiling, a way towards dialect compatibility #23

gabrielesilinic May 2, 2023 Collaborator

What can a gnome do?

Gnome callback?

The power of a gnome

The tools of a gnome

Replies: 2 comments · 11 replies

GitMensch May 2, 2023

GitMensch May 2, 2023

KTSnowy May 2, 2023 Maintainer

GitMensch May 2, 2023

gabrielesilinic May 2, 2023 Collaborator Author

KTSnowy May 2, 2023 Maintainer

GitMensch May 2, 2023

KTSnowy May 2, 2023 Maintainer

GitMensch May 2, 2023

gabrielesilinic May 2, 2023 Collaborator Author

KTSnowy May 2, 2023 Maintainer

gabrielesilinic
May 2, 2023
Collaborator

Replies: 2 comments 11 replies

GitMensch
May 2, 2023

KTSnowy May 2, 2023
Maintainer

gabrielesilinic May 2, 2023
Collaborator Author

KTSnowy May 2, 2023
Maintainer

GitMensch
May 2, 2023

KTSnowy May 2, 2023
Maintainer

gabrielesilinic May 2, 2023
Collaborator Author

KTSnowy May 2, 2023
Maintainer