Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support composable codemods #63

Merged
merged 4 commits into from
Oct 6, 2023
Merged

Support composable codemods #63

merged 4 commits into from
Oct 6, 2023

Conversation

drdavella
Copy link
Member

@drdavella drdavella commented Oct 4, 2023

Overview

Correctly support multiple codemod composition

Description

  • We need to correctly support the application of multiple codemods
  • We now invoke semgrep once per codemod and apply the results of each codemod to the source tree
  • This means the updated source tree state is now used as input to each subsequent semgrep/codemod invocation
  • This comes with a performance penalty but it's the only way to handle multiple codemods correctly
  • It is now important to respect the order in which codemods are given using --codemod-include
  • This solution does not account for the possibility of executing multiple codemods with externally generated results (e.g. from CodeQL or other rule providers). However we don't currently have any such codemods so the impact is nil.
  • This PR also involves some refactoring. I still view the CodemodExecutorWrapper as an intermediate workaround until we revisit the codemod class hierarchy

Future Work

  • Clean up the class hierarchy to support a cleaner codemod execution API
  • The --dry-run option should operate on a copy of the entire source tree since right now it doesn't work correctly with multiple codemods (not currently a problem for the platform)

src/codemodder/cli.py Outdated Show resolved Hide resolved
@drdavella drdavella marked this pull request as ready for review October 4, 2023 20:22
integration_tests/test_url_sandbox.py Show resolved Hide resolved
completed_process = subprocess.run(
command,
check=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if you could re-use some of the functionality from base integration test since it's almost the same. Also I think it's worth calling self.check_code_before() for sanity checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look at doing that but the base integration test encodes a fairly strong assumption about one codemod per test. If I inherit from the base integration test it means this test inherits the test_* method which won't work for this particular case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also mean moving some of the methods into a mixin but it's fine.

src/codemodder/cli.py Outdated Show resolved Hide resolved
src/codemodder/codemods/base_codemod.py Show resolved Hide resolved
src/codemodder/executor.py Show resolved Hide resolved
src/codemodder/semgrep.py Outdated Show resolved Hide resolved
src/codemodder/codemodder.py Outdated Show resolved Hide resolved
src/codemodder/codemodder.py Show resolved Hide resolved
Copy link
Contributor

@andrecsilva andrecsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole flow of execution for multiple codemods feels wrong.

There are three straightforward strategies to run multiple codemods:

  1. The output tree of a codemod is the input of the next;
  2. The output code of a codemod is parsed into a new tree which is the input of the next codemod;
  3. We rewrite the file with the output code which is then read and parsed into a tree for the next codemod.

These are ordered by speed. We adopt the third one (which is the reason --dry-run won't work). I see no reason why we just don't go with the second. The first one is the best, of course, but it relies on each codemod outputting "valid" trees (there is at least one codemod that I know does not).

@clavedeluna
Copy link
Contributor

  1. The output tree of a codemod is the input of the next;

this is what I assumed we wanted to happen which is why I asked for clarification

@drdavella
Copy link
Member Author

drdavella commented Oct 5, 2023

@andrecsilva @clavedeluna what you are describing is a potential optimization that is only possible if certain conditions hold. Remember, semgrep has no access to the parsed tree: it can only read files on disk. So no matter what, the result of any one codemod needs to be written to disk in order to be visible to the next semgrep invocation.

Now, we could still possibly keep the updated tree in memory and feed that to the next codemod, but only if the tree represents the updated metadata (i.e. line number, column) exactly the way that it is represented on disk. This is because we need to map any new semgrep results from the following codemod onto the updated tree. This is probably the case in libcst but I am not entirely sure (maybe @andrecsilva can confirm).

Even with that, there's another problem: storing parsed tree representations of the entire project can be very memory intensive, especially for very large projects. The Java codemodder has already encountered OOM issues with a similar strategy on very large repositories, so I would hesitate to implement any such optimization without a more detailed investigation of the tradeoffs.

While it may not be the most efficient strategy to parse each file multiple times, I would prefer to revisit any possible optimizations in a future PR.

@drdavella drdavella requested a review from andrecsilva October 5, 2023 13:23
@andrecsilva
Copy link
Contributor

@andrecsilva @clavedeluna what you are describing is a potential optimization that is only possible if certain conditions hold. Remember, semgrep has no access to the parsed tree: it can only read files on disk. So no matter what, the result of any one codemod needs to be written to disk in order to be visible to the next semgrep invocation.

Now, we could still possibly keep the updated tree in memory and feed that to the next codemod, but only if the tree represents the updated metadata (i.e. line number, column) exactly the way that it is represented on disk. This is because we need to map any new semgrep results from the following codemod onto the updated tree. This is probably the case in libcst but I am not entirely sure (maybe @andrecsilva can confirm).

Even with that, there's another problem: storing parsed tree representations of the entire project can be very memory intensive, especially for very large projects. The Java codemodder has already encountered OOM issues with a similar strategy on very large repositories, so I would hesitate to implement any such optimization without a more detailed investigation of the tradeoffs.

While it may not be the most efficient strategy to parse each file multiple times, I would prefer to revisit any possible optimizations in a future PR.

For the semgrep issue, you can still run semgrep on a temp file on a by-need basis. That is, if we are running a semgrep codemod, we dump the tree into a temp file and parse results with semgrep. Otherwise, just pass the tree along.

As long as you're running the codemods one file at a time (that is, pick a file, run all codemods for that file, never touch that file again), I don't see why memory would be an issue.

@drdavella
Copy link
Member Author

As long as you're running the codemods one file at a time (that is, pick a file, run all codemods for that file, never touch that file again), I don't see why memory would be an issue.

This would mean we would need to run semgrep once per file rather than once per codemod.

@drdavella drdavella force-pushed the support-composable-codemods branch from f3dc7d9 to 674961e Compare October 5, 2023 14:29
@andrecsilva
Copy link
Contributor

As long as you're running the codemods one file at a time (that is, pick a file, run all codemods for that file, never touch that file again), I don't see why memory would be an issue.

This would mean we would need to run semgrep once per file rather than once per codemod.

No, we are running semgrep for each codemod.

The codemod itself will run semgrep, but not for the original file. As I said, for a given tree T passed as an input for a semgrep codemod c, transform the T into code (use .code member in libcst), dump into a temp file, run semgrep on that temp file, then pass c(T) to the next codemod.

The point I wanted to make with the quoted paragraph is that you don't really need to keep a tree for every file in memory at once, just the a tree for a single file.

@drdavella drdavella requested a review from clavedeluna October 5, 2023 14:38
@codecov-commenter
Copy link

codecov-commenter commented Oct 5, 2023

Codecov Report

Merging #63 (674961e) into main (0790cc2) will decrease coverage by 0.51%.
The diff coverage is 93.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
- Coverage   96.32%   95.81%   -0.51%     
==========================================
  Files          44       45       +1     
  Lines        1660     1674      +14     
==========================================
+ Hits         1599     1604       +5     
- Misses         61       70       +9     
Files Coverage Δ
src/codemodder/cli.py 100.00% <100.00%> (ø)
src/codemodder/codemodder.py 96.34% <100.00%> (ø)
src/codemodder/codemods/base_codemod.py 100.00% <100.00%> (ø)
src/codemodder/context.py 100.00% <100.00%> (ø)
src/codemodder/file_context.py 100.00% <100.00%> (ø)
src/codemodder/registry.py 93.65% <100.00%> (-2.10%) ⬇️
src/codemodder/semgrep.py 95.45% <80.00%> (-0.85%) ⬇️
src/codemodder/executor.py 89.36% <89.36%> (ø)

... and 1 file with indirect coverage changes

@drdavella
Copy link
Member Author

drdavella commented Oct 5, 2023

@andrecsilva it's possible that I'm misunderstanding something but you said this:

pick a file, run all codemods for that file, never touch that file again

In order to run all codemods for that file, I would need to run semgrep for each applicable codemod on each file. Our current loop looks something like this:

for codemod in requested_codemods:
    codemod.maybe_apply_semgrep()
    for file in requested_files:
        codemod.make_changes()

But it seems like you are suggesting something like this:

for file in requested_files:
    # This means I never need to look at this file again and can potentially cache the tree
    for codemod in requested_codemods:
        codemod.maybe_apply_semgrep()
        codemod.make_changes()

There are probably some clever ways to get around this contraint. And you are right that we could potentially use tempfiles only for modified files but only if we run semgrep once per file.

There is of course an implicit tradeoff here, but it seems obvious (but also possibly wrong) to me that running semgrep for every file is going to be more expensive than the file I/O we perform currently. But the balance may tip in the other direction if we eventually implement more codemods purely in terms of libcst.

@drdavella drdavella force-pushed the support-composable-codemods branch from 674961e to f27f98f Compare October 5, 2023 15:58
@andrecsilva
Copy link
Contributor

@andrecsilva it's possible that I'm misunderstanding something but you said this:

pick a file, run all codemods for that file, never touch that file again

In order to run all codemods for that file, I would need to run semgrep for each applicable codemod on each file. Our current loop looks something like this:

for codemod in requested_codemods:
    codemod.maybe_apply_semgrep()
    for file in requested_files:
        codemod.make_changes()

But it seems like you are suggesting something like this:

for file in requested_files:
    # This means I never need to look at this file again and can potentially cache the tree
    for codemod in requested_codemods:
        codemod.maybe_apply_semgrep()
        codemod.make_changes()

Below is a more detailed take of what I'm suggesting.

for file in requested_files:
    # This means I never need to look at this file again
    tree = cst.parse_module(open(file).read())
    for codemod in requested_codemods:
        results =  run_semgrep(codemod.YAML_FILES,tree) if codemod.is_semgrep else {}
        tree = codemod.make_changes(tree, results)
    # the final tree will be written to file
    open(file).write(tree.code)

where run_semgrep would look like:

def run_semgrep(yaml_files, tree):
     # could be a common dir for the whole codemodder run
     temp_dir = ...
     temp_file = NamedTemporaryFile(..., dir = temp_dir)
     temp_file.write(tree.code)
     return run_on_directory(yaml_files, temp_dir)

This way we only run semgrep / write to disk as many times as there are semgrep codemods. The way it's done right now means we write to disk after every codemod.

There are probably some clever ways to get around this contraint. And you are right that we could potentially use tempfiles only for modified files but only if we run semgrep once per file.

Why? I don't understand why you think can only do that only if we run semgrep once per file.

@drdavella
Copy link
Member Author

drdavella commented Oct 5, 2023

@andrecsilva yes but that's exactly my point: the semgrep invocation in your example is now within the inner loop which means it is getting invoked num_files * num_codemods times in the limit rather than just num_codemods times as we do currently.

@andrecsilva
Copy link
Contributor

@andrecsilva yes but that's exactly my point: the semgrep invocation in your example is now within the inner loop which means it is getting invoked num_files * num_codemods times in the limit rather than just num_codemods times as we do currently.

In the following snippet:

for codemod in requested_codemods:
    codemod.maybe_apply_semgrep()
    for file in requested_files:
        codemod.make_changes()

Semgrep is called once per codemod, true. But you still have to run each codemod rule for each file in the directory (otherwise, how are you getting the results for that file?).

Surely in my proposed solution there are more calls to semgrep, but the number of times a particular rule is ran is the same (once per codemod x file). Yes, there is an extra overhead because in calling semgrep more times, but we have less file writes. Moreover this overhead and the writes are now proportional to the number of semgrep codemods.

I still think it's a net positive, specially since we expect to decrease the number of codemods that rely on semgrep.

@drdavella
Copy link
Member Author

drdavella commented Oct 5, 2023

I still think it's a net positive

You may be right but I'm less certain. In either case, this is something that could be answered empirically. I'd like to suggest that this kind of optimization is probably best undertaken as a separate ticket/effort so that we don't block the availability of this feature in the meantime.

@clavedeluna
Copy link
Contributor

Codecov Report

Merging #63 (674961e) into main (0790cc2) will decrease coverage by 0.51%.
The diff coverage is 93.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #63      +/-   ##
==========================================
- Coverage   96.32%   95.81%   -0.51%     
==========================================
  Files          44       45       +1     
  Lines        1660     1674      +14     
==========================================
+ Hits         1599     1604       +5     
- Misses         61       70       +9     

Files Coverage Δ
src/codemodder/cli.py 100.00% <100.00%> (ø)
src/codemodder/codemodder.py 96.34% <100.00%> (ø)
src/codemodder/codemods/base_codemod.py 100.00% <100.00%> (ø)
src/codemodder/context.py 100.00% <100.00%> (ø)
src/codemodder/file_context.py 100.00% <100.00%> (ø)
src/codemodder/registry.py 93.65% <100.00%> (-2.10%) ⬇️
src/codemodder/semgrep.py 95.45% <80.00%> (-0.85%) ⬇️
src/codemodder/executor.py 89.36% <89.36%> (ø)
... and 1 file with indirect coverage changes

are you able to see the files in codecov? I just see No Files covered by tests were changed which seems off. I'm logged in but idk if it's an account thing

@drdavella
Copy link
Member Author

@clavedeluna I'm seeing the same thing. I'm hoping we can just ignore the (small) coverage drop for now and address it later.

@andrecsilva you requested changes on this PR but I'm hoping that after our discussion we can defer any potential optimization to a future PR.

Copy link
Contributor

@andrecsilva andrecsilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you understand the highlighted caveats, I'm ok approving this.

@drdavella drdavella merged commit 4883916 into main Oct 6, 2023
6 checks passed
@drdavella drdavella deleted the support-composable-codemods branch October 6, 2023 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants