Support composable codemods #63

drdavella · 2023-10-04T19:44:41Z

Overview

Correctly support multiple codemod composition

Description

We need to correctly support the application of multiple codemods
We now invoke semgrep once per codemod and apply the results of each codemod to the source tree
This means the updated source tree state is now used as input to each subsequent semgrep/codemod invocation
This comes with a performance penalty but it's the only way to handle multiple codemods correctly
It is now important to respect the order in which codemods are given using --codemod-include
This solution does not account for the possibility of executing multiple codemods with externally generated results (e.g. from CodeQL or other rule providers). However we don't currently have any such codemods so the impact is nil.
This PR also involves some refactoring. I still view the CodemodExecutorWrapper as an intermediate workaround until we revisit the codemod class hierarchy

Future Work

Clean up the class hierarchy to support a cleaner codemod execution API
The --dry-run option should operate on a copy of the entire source tree since right now it doesn't work correctly with multiple codemods (not currently a problem for the platform)

integration_tests/test_url_sandbox.py

src/codemodder/cli.py

integration_tests/test_url_sandbox.py

clavedeluna · 2023-10-05T10:54:38Z

integration_tests/test_multiple_codemods.py

+        completed_process = subprocess.run(
+            command,
+            check=False,
+        )


would be nice if you could re-use some of the functionality from base integration test since it's almost the same. Also I think it's worth calling self.check_code_before() for sanity checking.

I did look at doing that but the base integration test encodes a fairly strong assumption about one codemod per test. If I inherit from the base integration test it means this test inherits the test_* method which won't work for this particular case.

It could also mean moving some of the methods into a mixin but it's fine.

src/codemodder/cli.py

src/codemodder/codemods/base_codemod.py

src/codemodder/executor.py

src/codemodder/semgrep.py

src/codemodder/codemodder.py

andrecsilva

The whole flow of execution for multiple codemods feels wrong.

There are three straightforward strategies to run multiple codemods:

The output tree of a codemod is the input of the next;
The output code of a codemod is parsed into a new tree which is the input of the next codemod;
We rewrite the file with the output code which is then read and parsed into a tree for the next codemod.

These are ordered by speed. We adopt the third one (which is the reason --dry-run won't work). I see no reason why we just don't go with the second. The first one is the best, of course, but it relies on each codemod outputting "valid" trees (there is at least one codemod that I know does not).

clavedeluna · 2023-10-05T11:57:40Z

The output tree of a codemod is the input of the next;

this is what I assumed we wanted to happen which is why I asked for clarification

drdavella · 2023-10-05T13:19:08Z

@andrecsilva @clavedeluna what you are describing is a potential optimization that is only possible if certain conditions hold. Remember, semgrep has no access to the parsed tree: it can only read files on disk. So no matter what, the result of any one codemod needs to be written to disk in order to be visible to the next semgrep invocation.

Now, we could still possibly keep the updated tree in memory and feed that to the next codemod, but only if the tree represents the updated metadata (i.e. line number, column) exactly the way that it is represented on disk. This is because we need to map any new semgrep results from the following codemod onto the updated tree. This is probably the case in libcst but I am not entirely sure (maybe @andrecsilva can confirm).

Even with that, there's another problem: storing parsed tree representations of the entire project can be very memory intensive, especially for very large projects. The Java codemodder has already encountered OOM issues with a similar strategy on very large repositories, so I would hesitate to implement any such optimization without a more detailed investigation of the tradeoffs.

While it may not be the most efficient strategy to parse each file multiple times, I would prefer to revisit any possible optimizations in a future PR.

andrecsilva · 2023-10-05T13:58:53Z

@andrecsilva @clavedeluna what you are describing is a potential optimization that is only possible if certain conditions hold. Remember, semgrep has no access to the parsed tree: it can only read files on disk. So no matter what, the result of any one codemod needs to be written to disk in order to be visible to the next semgrep invocation.

Now, we could still possibly keep the updated tree in memory and feed that to the next codemod, but only if the tree represents the updated metadata (i.e. line number, column) exactly the way that it is represented on disk. This is because we need to map any new semgrep results from the following codemod onto the updated tree. This is probably the case in libcst but I am not entirely sure (maybe @andrecsilva can confirm).

Even with that, there's another problem: storing parsed tree representations of the entire project can be very memory intensive, especially for very large projects. The Java codemodder has already encountered OOM issues with a similar strategy on very large repositories, so I would hesitate to implement any such optimization without a more detailed investigation of the tradeoffs.

While it may not be the most efficient strategy to parse each file multiple times, I would prefer to revisit any possible optimizations in a future PR.

For the semgrep issue, you can still run semgrep on a temp file on a by-need basis. That is, if we are running a semgrep codemod, we dump the tree into a temp file and parse results with semgrep. Otherwise, just pass the tree along.

As long as you're running the codemods one file at a time (that is, pick a file, run all codemods for that file, never touch that file again), I don't see why memory would be an issue.

drdavella · 2023-10-05T14:07:30Z

As long as you're running the codemods one file at a time (that is, pick a file, run all codemods for that file, never touch that file again), I don't see why memory would be an issue.

This would mean we would need to run semgrep once per file rather than once per codemod.

andrecsilva · 2023-10-05T14:31:54Z

As long as you're running the codemods one file at a time (that is, pick a file, run all codemods for that file, never touch that file again), I don't see why memory would be an issue.

This would mean we would need to run semgrep once per file rather than once per codemod.

No, we are running semgrep for each codemod.

The codemod itself will run semgrep, but not for the original file. As I said, for a given tree T passed as an input for a semgrep codemod c, transform the T into code (use .code member in libcst), dump into a temp file, run semgrep on that temp file, then pass c(T) to the next codemod.

The point I wanted to make with the quoted paragraph is that you don't really need to keep a tree for every file in memory at once, just the a tree for a single file.

codecov-commenter · 2023-10-05T14:38:56Z

Codecov Report

Merging #63 (674961e) into main (0790cc2) will decrease coverage by 0.51%.
The diff coverage is 93.93%.

@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
- Coverage   96.32%   95.81%   -0.51%     
==========================================
  Files          44       45       +1     
  Lines        1660     1674      +14     
==========================================
+ Hits         1599     1604       +5     
- Misses         61       70       +9

Files	Coverage Δ
src/codemodder/cli.py	`100.00% <100.00%> (ø)`
src/codemodder/codemodder.py	`96.34% <100.00%> (ø)`
src/codemodder/codemods/base_codemod.py	`100.00% <100.00%> (ø)`
src/codemodder/context.py	`100.00% <100.00%> (ø)`
src/codemodder/file_context.py	`100.00% <100.00%> (ø)`
src/codemodder/registry.py	`93.65% <100.00%> (-2.10%)`	⬇️
src/codemodder/semgrep.py	`95.45% <80.00%> (-0.85%)`	⬇️
src/codemodder/executor.py	`89.36% <89.36%> (ø)`

... and 1 file with indirect coverage changes

drdavella · 2023-10-05T14:45:26Z

@andrecsilva it's possible that I'm misunderstanding something but you said this:

pick a file, run all codemods for that file, never touch that file again

In order to run all codemods for that file, I would need to run semgrep for each applicable codemod on each file. Our current loop looks something like this:

for codemod in requested_codemods:
    codemod.maybe_apply_semgrep()
    for file in requested_files:
        codemod.make_changes()

But it seems like you are suggesting something like this:

for file in requested_files:
    # This means I never need to look at this file again and can potentially cache the tree
    for codemod in requested_codemods:
        codemod.maybe_apply_semgrep()
        codemod.make_changes()

There are probably some clever ways to get around this contraint. And you are right that we could potentially use tempfiles only for modified files but only if we run semgrep once per file.

There is of course an implicit tradeoff here, but it seems obvious (but also possibly wrong) to me that running semgrep for every file is going to be more expensive than the file I/O we perform currently. But the balance may tip in the other direction if we eventually implement more codemods purely in terms of libcst.

andrecsilva · 2023-10-05T16:48:34Z

@andrecsilva it's possible that I'm misunderstanding something but you said this:

pick a file, run all codemods for that file, never touch that file again

In order to run all codemods for that file, I would need to run semgrep for each applicable codemod on each file. Our current loop looks something like this:
for codemod in requested_codemods:
    codemod.maybe_apply_semgrep()
    for file in requested_files:
        codemod.make_changes()
But it seems like you are suggesting something like this:
for file in requested_files:
    # This means I never need to look at this file again and can potentially cache the tree
    for codemod in requested_codemods:
        codemod.maybe_apply_semgrep()
        codemod.make_changes()

Below is a more detailed take of what I'm suggesting.

for file in requested_files:
    # This means I never need to look at this file again
    tree = cst.parse_module(open(file).read())
    for codemod in requested_codemods:
        results =  run_semgrep(codemod.YAML_FILES,tree) if codemod.is_semgrep else {}
        tree = codemod.make_changes(tree, results)
    # the final tree will be written to file
    open(file).write(tree.code)

where run_semgrep would look like:

def run_semgrep(yaml_files, tree):
     # could be a common dir for the whole codemodder run
     temp_dir = ...
     temp_file = NamedTemporaryFile(..., dir = temp_dir)
     temp_file.write(tree.code)
     return run_on_directory(yaml_files, temp_dir)

This way we only run semgrep / write to disk as many times as there are semgrep codemods. The way it's done right now means we write to disk after every codemod.

There are probably some clever ways to get around this contraint. And you are right that we could potentially use tempfiles only for modified files but only if we run semgrep once per file.

Why? I don't understand why you think can only do that only if we run semgrep once per file.

drdavella · 2023-10-05T17:15:42Z

@andrecsilva yes but that's exactly my point: the semgrep invocation in your example is now within the inner loop which means it is getting invoked num_files * num_codemods times in the limit rather than just num_codemods times as we do currently.

andrecsilva · 2023-10-05T18:06:26Z

@andrecsilva yes but that's exactly my point: the semgrep invocation in your example is now within the inner loop which means it is getting invoked num_files * num_codemods times in the limit rather than just num_codemods times as we do currently.

In the following snippet:

for codemod in requested_codemods:
    codemod.maybe_apply_semgrep()
    for file in requested_files:
        codemod.make_changes()

Semgrep is called once per codemod, true. But you still have to run each codemod rule for each file in the directory (otherwise, how are you getting the results for that file?).

Surely in my proposed solution there are more calls to semgrep, but the number of times a particular rule is ran is the same (once per codemod x file). Yes, there is an extra overhead because in calling semgrep more times, but we have less file writes. Moreover this overhead and the writes are now proportional to the number of semgrep codemods.

I still think it's a net positive, specially since we expect to decrease the number of codemods that rely on semgrep.

drdavella · 2023-10-05T18:46:14Z

I still think it's a net positive

You may be right but I'm less certain. In either case, this is something that could be answered empirically. I'd like to suggest that this kind of optimization is probably best undertaken as a separate ticket/effort so that we don't block the availability of this feature in the meantime.

clavedeluna · 2023-10-06T11:17:38Z

Codecov Report

Merging #63 (674961e) into main (0790cc2) will decrease coverage by 0.51%.
The diff coverage is 93.93%.
@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
- Coverage   96.32%   95.81%   -0.51%     
==========================================
  Files          44       45       +1     
  Lines        1660     1674      +14     
==========================================
+ Hits         1599     1604       +5     
- Misses         61       70       +9     
Files Coverage Δ
src/codemodder/cli.py 100.00% <100.00%> (ø)
src/codemodder/codemodder.py 96.34% <100.00%> (ø)
src/codemodder/codemods/base_codemod.py 100.00% <100.00%> (ø)
src/codemodder/context.py 100.00% <100.00%> (ø)
src/codemodder/file_context.py 100.00% <100.00%> (ø)
src/codemodder/registry.py 93.65% <100.00%> (-2.10%) ⬇️
src/codemodder/semgrep.py 95.45% <80.00%> (-0.85%) ⬇️
src/codemodder/executor.py 89.36% <89.36%> (ø)
... and 1 file with indirect coverage changes

are you able to see the files in codecov? I just see No Files covered by tests were changed which seems off. I'm logged in but idk if it's an account thing

drdavella · 2023-10-06T13:27:12Z

@clavedeluna I'm seeing the same thing. I'm hoping we can just ignore the (small) coverage drop for now and address it later.

@andrecsilva you requested changes on this PR but I'm hoping that after our discussion we can defer any potential optimization to a future PR.

andrecsilva

As long as you understand the highlighted caveats, I'm ok approving this.

drdavella added 2 commits October 4, 2023 13:21

Rename __main__.py -> codemodder.py. Remove callable module

5bf76de

Better error message when directory doesn't exist

d6538f0

drdavella commented Oct 4, 2023

View reviewed changes

integration_tests/test_url_sandbox.py Show resolved Hide resolved

drdavella commented Oct 4, 2023

View reviewed changes

src/codemodder/cli.py Outdated Show resolved Hide resolved

drdavella marked this pull request as ready for review October 4, 2023 20:22

drdavella requested review from clavedeluna and andrecsilva as code owners October 4, 2023 20:22

clavedeluna requested changes Oct 5, 2023

View reviewed changes

andrecsilva requested changes Oct 5, 2023

View reviewed changes

drdavella requested a review from andrecsilva October 5, 2023 13:23

drdavella force-pushed the support-composable-codemods branch from f3dc7d9 to 674961e Compare October 5, 2023 14:29

drdavella requested a review from clavedeluna October 5, 2023 14:38

drdavella added 2 commits October 5, 2023 11:51

Run semgrep once per codemod; enable codemod composition

9b4a0ec

Respect given order of included codemods

f27f98f

drdavella force-pushed the support-composable-codemods branch from 674961e to f27f98f Compare October 5, 2023 15:58

clavedeluna approved these changes Oct 6, 2023

View reviewed changes

andrecsilva approved these changes Oct 6, 2023

View reviewed changes

drdavella merged commit 4883916 into main Oct 6, 2023
6 checks passed

drdavella deleted the support-composable-codemods branch October 6, 2023 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support composable codemods #63

Support composable codemods #63

drdavella commented Oct 4, 2023 •

edited

Loading

clavedeluna Oct 5, 2023

drdavella Oct 5, 2023

clavedeluna Oct 6, 2023

andrecsilva left a comment

clavedeluna commented Oct 5, 2023

drdavella commented Oct 5, 2023 •

edited

Loading

andrecsilva commented Oct 5, 2023

drdavella commented Oct 5, 2023

andrecsilva commented Oct 5, 2023

codecov-commenter commented Oct 5, 2023 •

edited

Loading

drdavella commented Oct 5, 2023 •

edited

Loading

andrecsilva commented Oct 5, 2023

drdavella commented Oct 5, 2023 •

edited

Loading

andrecsilva commented Oct 5, 2023

drdavella commented Oct 5, 2023 •

edited

Loading

clavedeluna commented Oct 6, 2023

Codecov Report

drdavella commented Oct 6, 2023

andrecsilva left a comment

Support composable codemods #63

Support composable codemods #63

Conversation

drdavella commented Oct 4, 2023 • edited Loading

Overview

Description

Future Work

clavedeluna Oct 5, 2023

Choose a reason for hiding this comment

drdavella Oct 5, 2023

Choose a reason for hiding this comment

clavedeluna Oct 6, 2023

Choose a reason for hiding this comment

andrecsilva left a comment

Choose a reason for hiding this comment

clavedeluna commented Oct 5, 2023

drdavella commented Oct 5, 2023 • edited Loading

andrecsilva commented Oct 5, 2023

drdavella commented Oct 5, 2023

andrecsilva commented Oct 5, 2023

codecov-commenter commented Oct 5, 2023 • edited Loading

Codecov Report

drdavella commented Oct 5, 2023 • edited Loading

andrecsilva commented Oct 5, 2023

drdavella commented Oct 5, 2023 • edited Loading

andrecsilva commented Oct 5, 2023

drdavella commented Oct 5, 2023 • edited Loading

clavedeluna commented Oct 6, 2023

Codecov Report

drdavella commented Oct 6, 2023

andrecsilva left a comment

Choose a reason for hiding this comment

drdavella commented Oct 4, 2023 •

edited

Loading

drdavella commented Oct 5, 2023 •

edited

Loading

codecov-commenter commented Oct 5, 2023 •

edited

Loading

drdavella commented Oct 5, 2023 •

edited

Loading

drdavella commented Oct 5, 2023 •

edited

Loading

drdavella commented Oct 5, 2023 •

edited

Loading