-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support composable codemods #63
Conversation
completed_process = subprocess.run( | ||
command, | ||
check=False, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice if you could re-use some of the functionality from base integration test since it's almost the same. Also I think it's worth calling self.check_code_before()
for sanity checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did look at doing that but the base integration test encodes a fairly strong assumption about one codemod per test. If I inherit from the base integration test it means this test inherits the test_*
method which won't work for this particular case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also mean moving some of the methods into a mixin but it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole flow of execution for multiple codemods feels wrong.
There are three straightforward strategies to run multiple codemods:
- The output tree of a codemod is the input of the next;
- The output code of a codemod is parsed into a new tree which is the input of the next codemod;
- We rewrite the file with the output code which is then read and parsed into a tree for the next codemod.
These are ordered by speed. We adopt the third one (which is the reason --dry-run
won't work). I see no reason why we just don't go with the second. The first one is the best, of course, but it relies on each codemod outputting "valid" trees (there is at least one codemod that I know does not).
this is what I assumed we wanted to happen which is why I asked for clarification |
@andrecsilva @clavedeluna what you are describing is a potential optimization that is only possible if certain conditions hold. Remember, semgrep has no access to the parsed tree: it can only read files on disk. So no matter what, the result of any one codemod needs to be written to disk in order to be visible to the next semgrep invocation. Now, we could still possibly keep the updated tree in memory and feed that to the next codemod, but only if the tree represents the updated metadata (i.e. line number, column) exactly the way that it is represented on disk. This is because we need to map any new semgrep results from the following codemod onto the updated tree. This is probably the case in libcst but I am not entirely sure (maybe @andrecsilva can confirm). Even with that, there's another problem: storing parsed tree representations of the entire project can be very memory intensive, especially for very large projects. The Java codemodder has already encountered OOM issues with a similar strategy on very large repositories, so I would hesitate to implement any such optimization without a more detailed investigation of the tradeoffs. While it may not be the most efficient strategy to parse each file multiple times, I would prefer to revisit any possible optimizations in a future PR. |
For the semgrep issue, you can still run semgrep on a temp file on a by-need basis. That is, if we are running a semgrep codemod, we dump the tree into a temp file and parse results with semgrep. Otherwise, just pass the tree along. As long as you're running the codemods one file at a time (that is, pick a file, run all codemods for that file, never touch that file again), I don't see why memory would be an issue. |
This would mean we would need to run semgrep once per file rather than once per codemod. |
f3dc7d9
to
674961e
Compare
No, we are running semgrep for each codemod. The codemod itself will run semgrep, but not for the original file. As I said, for a given tree The point I wanted to make with the quoted paragraph is that you don't really need to keep a tree for every file in memory at once, just the a tree for a single file. |
Codecov Report
@@ Coverage Diff @@
## main #63 +/- ##
==========================================
- Coverage 96.32% 95.81% -0.51%
==========================================
Files 44 45 +1
Lines 1660 1674 +14
==========================================
+ Hits 1599 1604 +5
- Misses 61 70 +9
|
@andrecsilva it's possible that I'm misunderstanding something but you said this:
In order to run all codemods for that file, I would need to run semgrep for each applicable codemod on each file. Our current loop looks something like this: for codemod in requested_codemods:
codemod.maybe_apply_semgrep()
for file in requested_files:
codemod.make_changes() But it seems like you are suggesting something like this: for file in requested_files:
# This means I never need to look at this file again and can potentially cache the tree
for codemod in requested_codemods:
codemod.maybe_apply_semgrep()
codemod.make_changes() There are probably some clever ways to get around this contraint. And you are right that we could potentially use tempfiles only for modified files but only if we run semgrep once per file. There is of course an implicit tradeoff here, but it seems obvious (but also possibly wrong) to me that running semgrep for every file is going to be more expensive than the file I/O we perform currently. But the balance may tip in the other direction if we eventually implement more codemods purely in terms of libcst. |
674961e
to
f27f98f
Compare
Below is a more detailed take of what I'm suggesting. for file in requested_files:
# This means I never need to look at this file again
tree = cst.parse_module(open(file).read())
for codemod in requested_codemods:
results = run_semgrep(codemod.YAML_FILES,tree) if codemod.is_semgrep else {}
tree = codemod.make_changes(tree, results)
# the final tree will be written to file
open(file).write(tree.code) where def run_semgrep(yaml_files, tree):
# could be a common dir for the whole codemodder run
temp_dir = ...
temp_file = NamedTemporaryFile(..., dir = temp_dir)
temp_file.write(tree.code)
return run_on_directory(yaml_files, temp_dir) This way we only run semgrep / write to disk as many times as there are semgrep codemods. The way it's done right now means we write to disk after every codemod.
Why? I don't understand why you think can only do that only if we run semgrep once per file. |
@andrecsilva yes but that's exactly my point: the semgrep invocation in your example is now within the inner loop which means it is getting invoked |
In the following snippet: for codemod in requested_codemods:
codemod.maybe_apply_semgrep()
for file in requested_files:
codemod.make_changes() Semgrep is called once per codemod, true. But you still have to run each codemod rule for each file in the directory (otherwise, how are you getting the results for that file?). Surely in my proposed solution there are more calls to semgrep, but the number of times a particular rule is ran is the same (once per codemod x file). Yes, there is an extra overhead because in calling semgrep more times, but we have less file writes. Moreover this overhead and the writes are now proportional to the number of semgrep codemods. I still think it's a net positive, specially since we expect to decrease the number of codemods that rely on semgrep. |
You may be right but I'm less certain. In either case, this is something that could be answered empirically. I'd like to suggest that this kind of optimization is probably best undertaken as a separate ticket/effort so that we don't block the availability of this feature in the meantime. |
are you able to see the files in codecov? I just see |
@clavedeluna I'm seeing the same thing. I'm hoping we can just ignore the (small) coverage drop for now and address it later. @andrecsilva you requested changes on this PR but I'm hoping that after our discussion we can defer any potential optimization to a future PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as you understand the highlighted caveats, I'm ok approving this.
Overview
Correctly support multiple codemod composition
Description
--codemod-include
CodemodExecutorWrapper
as an intermediate workaround until we revisit the codemod class hierarchyFuture Work
--dry-run
option should operate on a copy of the entire source tree since right now it doesn't work correctly with multiple codemods (not currently a problem for the platform)