-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add name resolution to threading codemod #71
Conversation
Codecov Report
@@ Coverage Diff @@
## main #71 +/- ##
==========================================
+ Coverage 96.13% 96.22% +0.08%
==========================================
Files 46 46
Lines 1785 1826 +41
==========================================
+ Hits 1716 1757 +41
Misses 69 69
|
31cb417
to
9d69aff
Compare
9c05109
to
b22a4d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clavedeluna I think this is a really tricky problem so thanks for being patient with the feedback and iterations. I have a few comments that boil down to the following:
- Can we use more accurate names for semaphores and conditions?
- Is there a more general iterative solution to avoid name collisions?
- Are we properly handling multiple changes to the same module?
- Are we properly handling nested cases?
@@ -86,3 +86,56 @@ def test_no_effect_multiple_with_clauses(self, tmpdir, klass): | |||
... | |||
""" | |||
self.run_and_assert(tmpdir, input_code, input_code) | |||
|
|||
|
|||
class TestThreadingNameResolution(BaseSemgrepCodemodTest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need another test here:
from threading import Lock
with Lock():
foo()
with Lock():
bar()
This case probably isn't such a concern because variable reuse is probably acceptable. But we also need another test case:
with Lock():
with Lock():
foo()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrecsilva wondering if these are the cases you're referring to by suggesting to implement get current names at init? because right now the codemod will change
with Lock():
with Lock():
foo()
to
import threading
lock = threading.Lock()
with lock:
lock = threading.Lock()
with lock:
foo()
which is not great since it itself is adding a name clash. This is happening because both calls to leave_With
happen one after the other, with the second one not "learning about" the new addition of the first lock = ...
. I don't believe this would be solved by moving current_names = self.find_used_names_in_module()
to codemod init because that too will happen only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was what I was afraid it would happen.
To reiterate, my suggestion is moving the calculation of current_names
to WithThreadingLock
's __init__
and update current_names
with the newly added name within leave_With
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh I see what you mean now.
def find_used_names_in_module(self): | ||
""" | ||
Find all the used names in the scope of a libcst Module. | ||
""" | ||
names = [] | ||
scope = self.find_global_scope() | ||
if scope is None: | ||
return [] # pragma: no cover | ||
|
||
nodes = [x.node for x in scope.assignments] | ||
for other_nodes in nodes: | ||
visitor = GatherNamesVisitor() | ||
other_nodes.visit(visitor) | ||
names.extend(visitor.names) | ||
return names | ||
|
||
def find_global_scope(self): | ||
"""Find the global scope for a libcst Module node.""" | ||
scopes = self.context.wrapper.resolve(ScopeProvider).values() | ||
for scope in scopes: | ||
if isinstance(scope, GlobalScope): | ||
return scope | ||
return None # pragma: no cover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it currently is, It's grabbing the names based on the global scope. This means no matter where the Lock()
call is detected current_names
will always return the same names.
You should calculate the current_names
at __init__
and update it accordingly when you add a new name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means no matter where the Lock() call is detected current_names will always return the same names.
this is correct and I believe exactly what we want. When we hit this threading codemod at leave_With
, we are hitting a module's with clause. With the current approach, we grab the entire module where the with clause is located and find all the variable / import names. This means we will attempt to never pick a variable name for the lock at any scope, even if we could pick lock
bc it's used in some other function. This is the safest thing.
Calculating current_names at init will result in the exact same thing but the downside is that we have to update the codemod architecture. With this approach, all I had to do is to update leave_With
and call find_used_names_in_module
. This architecture is more generalizable than adding an init call.
If you think I'm mistaken here give me a unit test that would fail the current approach so I can iterate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more of a question but I'm unclear what "global" scope means in this context. Is it the module-level scope? Or is it some kind of scope containing all variables declared anywhere in the module? I would be rather surprised to find out that it's the latter. But here's a potential test case:
# no conflicts declared at module-level scope
def my_func():
# Will we correctly detect this conflict?
lock = "whatever"
with threading.Lock():
foo()
I'm definitely curious about this, especially since I think all of the existing test cases are implemented in terms of module-level scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is it some kind of scope containing all variables declared anywhere in the module?
It is exactly this. I said it somewhere along the lines but it will in fact find all variable / import names. This is good bc it's extra safe, but of course there are scopes in which we could reuse but we don't.
I"ve added the test case that you've pointed out and it passes as expected - we add a new lock_1
variable right under lock = "whatever"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's module-level scope. I believe the current implementation will be able to detect the conflict in your particular example.
4142817
to
f5054e1
Compare
Alright so here is the current state of this PR and addressing some of Dan's questions
I have updated the PR so that variable names would be
The current general solution is you'd be able to call
I have added the unit test for these and what we find is that while we can properly handle name collision for existing code, we can't properly handle our own introduction of name collision. That is, in the rare case of having to add two locks/semaphores, etc to a file, we will re-use the same variable name. This is not as terrible as it sounds since it will just re-create the thread. However, it isn't great. I tried to look into why So what should we do? I'm sure libcst is smart enough to be able to handle this, but not with the current architecture. The codemod will have to be a bit more complex to handle this relatively rare case. I'd like to propose we accept the current implementation with this edge case as is and come back to it. This is an interesting yet LOW importance codemod, so I'd like to move on to more important ones. The added unit tests currently demonstrate this drawback. |
489832f
to
330bb6b
Compare
In libcst, the tree won't actually change while visiting the nodes. That is, when you return a node within a An naive solution to that is to update a single call per codemod run. This way the metadata will be accurate after every change. However this will be fairly expensive and a simple solution of keeping track of the newly added names should do. I can probably scrounge a more complex (and general) solution that is scope aware and can deal with the edge cases within a single codemod call. I don't mind revisiting this later after I'm done with my current tasks. |
330bb6b
to
8784bce
Compare
Alright! I believe I've now addressed all issues in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the unit test for these and what we find is that while we can properly handle name collision for existing code, we can't properly handle our own introduction of name collision. That is, in the rare case of having to add two locks/semaphores, etc to a file, we will re-use the same variable name. This is not as terrible as it sounds since it will just re-create the thread. However, it isn't great. I tried to look into why self.find_used_names_in_module() doesn't reflect the added lock variable when the second with condition is called. It seems to be because libcst is calling on these two with statements in sequence, one after the other, and the global state isn't getting updated.
Why can't we just keep a list of variable names that we have created and make sure that we don't collide with any of those either? For example, if the codemod adds a variable named lock, can't we just record that somehow and make sure we don't reuse that name in a subsequent change?
I was able to do just that finally. Sorry for the back and forth, I needed some further explanation on the review from Andre. I now am able to do exactly what we want. |
with lock_1: | ||
lock = threading.Lock() | ||
with lock: | ||
print() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notice this test. libcst processes the internal with statement first, so our naming will be like this. Not a huuuge deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the continued back and forth but I'm asking for another test case which should help clarify @andrecsilva's question.
def find_used_names_in_module(self): | ||
""" | ||
Find all the used names in the scope of a libcst Module. | ||
""" | ||
names = [] | ||
scope = self.find_global_scope() | ||
if scope is None: | ||
return [] # pragma: no cover | ||
|
||
nodes = [x.node for x in scope.assignments] | ||
for other_nodes in nodes: | ||
visitor = GatherNamesVisitor() | ||
other_nodes.visit(visitor) | ||
names.extend(visitor.names) | ||
return names | ||
|
||
def find_global_scope(self): | ||
"""Find the global scope for a libcst Module node.""" | ||
scopes = self.context.wrapper.resolve(ScopeProvider).values() | ||
for scope in scopes: | ||
if isinstance(scope, GlobalScope): | ||
return scope | ||
return None # pragma: no cover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more of a question but I'm unclear what "global" scope means in this context. Is it the module-level scope? Or is it some kind of scope containing all variables declared anywhere in the module? I would be rather surprised to find out that it's the latter. But here's a potential test case:
# no conflicts declared at module-level scope
def my_func():
# Will we correctly detect this conflict?
lock = "whatever"
with threading.Lock():
foo()
I'm definitely curious about this, especially since I think all of the existing test cases are implemented in terms of module-level scope.
8784bce
to
fb55680
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small changes to a test.
fb55680
to
800fb77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for all of the back-and-forth 😅
Overview
Threading codemod that adds a new variable should name the variable to one that's not already in the module scope
Description
lock
orlock_cm
lock
orlock_cm
for the new lock variable, in hopes that either is never used. There may still be a tiny potential for name collusion.