add name resolution to threading codemod #71

clavedeluna · 2023-10-11T12:05:02Z

Overview

Threading codemod that adds a new variable should name the variable to one that's not already in the module scope

Description

get all the named vars / attrs / imports in the input code and from there determine if we should name the new lock either lock or lock_cm
we find the code module's Global scope to find all the used names. We then pick either lock or lock_cm for the new lock variable, in hopes that either is never used. There may still be a tiny potential for name collusion.
Finding the global scope for the node's module is sufficient to get all the used names in the module. I tested this against getting used names for all nodes (function scope, class scope, global scope, etc) and turns out global scope contains everything.

codecov · 2023-10-11T12:13:18Z

Codecov Report

Merging #71 (800fb77) into main (8128ee6) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #71      +/-   ##
==========================================
+ Coverage   96.13%   96.22%   +0.08%     
==========================================
  Files          46       46              
  Lines        1785     1826      +41     
==========================================
+ Hits         1716     1757      +41     
  Misses         69       69

Files	Coverage Δ
src/codemodder/codemods/utils_mixin.py	`90.47% <100.00%> (+3.17%)`	⬆️
src/core_codemods/with_threading_lock.py	`100.00% <100.00%> (ø)`

src/core_codemods/with_threading_lock.py

src/codemodder/codemods/utils_mixin.py

src/core_codemods/with_threading_lock.py

drdavella

@clavedeluna I think this is a really tricky problem so thanks for being patient with the feedback and iterations. I have a few comments that boil down to the following:

Can we use more accurate names for semaphores and conditions?
Is there a more general iterative solution to avoid name collisions?
Are we properly handling multiple changes to the same module?
Are we properly handling nested cases?

src/core_codemods/with_threading_lock.py

drdavella · 2023-10-12T13:42:39Z

tests/codemods/test_with_threading_lock.py

@@ -86,3 +86,56 @@ def test_no_effect_multiple_with_clauses(self, tmpdir, klass):
    ...
 """
        self.run_and_assert(tmpdir, input_code, input_code)
+
+
+class TestThreadingNameResolution(BaseSemgrepCodemodTest):


I think we need another test here:

from threading import Lock with Lock(): foo() with Lock(): bar()

This case probably isn't such a concern because variable reuse is probably acceptable. But we also need another test case:

with Lock(): with Lock(): foo()

@andrecsilva wondering if these are the cases you're referring to by suggesting to implement get current names at init? because right now the codemod will change

with Lock(): with Lock(): foo()

to

import threading lock = threading.Lock() with lock: lock = threading.Lock() with lock: foo()

which is not great since it itself is adding a name clash. This is happening because both calls to leave_With happen one after the other, with the second one not "learning about" the new addition of the first lock = .... I don't believe this would be solved by moving current_names = self.find_used_names_in_module() to codemod init because that too will happen only once.

Yes, this was what I was afraid it would happen.
To reiterate, my suggestion is moving the calculation of current_names to WithThreadingLock's __init__ and update current_names with the newly added name within leave_With.

ahh I see what you mean now.

andrecsilva · 2023-10-16T11:02:45Z

src/codemodder/codemods/utils_mixin.py

+    def find_used_names_in_module(self):
+        """
+        Find all the used names in the scope of a libcst Module.
+        """
+        names = []
+        scope = self.find_global_scope()
+        if scope is None:
+            return []  # pragma: no cover
+
+        nodes = [x.node for x in scope.assignments]
+        for other_nodes in nodes:
+            visitor = GatherNamesVisitor()
+            other_nodes.visit(visitor)
+            names.extend(visitor.names)
+        return names
+
+    def find_global_scope(self):
+        """Find the global scope for a libcst Module node."""
+        scopes = self.context.wrapper.resolve(ScopeProvider).values()
+        for scope in scopes:
+            if isinstance(scope, GlobalScope):
+                return scope
+        return None  # pragma: no cover


As it currently is, It's grabbing the names based on the global scope. This means no matter where the Lock() call is detected current_names will always return the same names.

You should calculate the current_names at __init__ and update it accordingly when you add a new name.

This means no matter where the Lock() call is detected current_names will always return the same names.

this is correct and I believe exactly what we want. When we hit this threading codemod at leave_With, we are hitting a module's with clause. With the current approach, we grab the entire module where the with clause is located and find all the variable / import names. This means we will attempt to never pick a variable name for the lock at any scope, even if we could pick lock bc it's used in some other function. This is the safest thing.

Calculating current_names at init will result in the exact same thing but the downside is that we have to update the codemod architecture. With this approach, all I had to do is to update leave_With and call find_used_names_in_module. This architecture is more generalizable than adding an init call.

If you think I'm mistaken here give me a unit test that would fail the current approach so I can iterate.

This is more of a question but I'm unclear what "global" scope means in this context. Is it the module-level scope? Or is it some kind of scope containing all variables declared anywhere in the module? I would be rather surprised to find out that it's the latter. But here's a potential test case:

# no conflicts declared at module-level scope def my_func(): # Will we correctly detect this conflict? lock = "whatever" with threading.Lock(): foo()

I'm definitely curious about this, especially since I think all of the existing test cases are implemented in terms of module-level scope.

Or is it some kind of scope containing all variables declared anywhere in the module?

It is exactly this. I said it somewhere along the lines but it will in fact find all variable / import names. This is good bc it's extra safe, but of course there are scopes in which we could reuse but we don't.

I"ve added the test case that you've pointed out and it passes as expected - we add a new lock_1 variable right under lock = "whatever"

It's module-level scope. I believe the current implementation will be able to detect the conflict in your particular example.

clavedeluna · 2023-10-16T12:52:16Z

Alright so here is the current state of this PR and addressing some of Dan's questions

Can we use more accurate names for semaphores and conditions?

I have updated the PR so that variable names would be lock or lock_cm or condition or condition_cm, etc. I still retained the _cm suffix because I see it equivalent to adding a _1 suffix. I can change if you disagree.

Is there a more general iterative solution to avoid name collisions?

The current general solution is you'd be able to call current_names = self.find_used_names_in_module() in any codemod. See caveats below.

Are we properly handling multiple changes to the same module?
Are we properly handling nested cases?

I have added the unit test for these and what we find is that while we can properly handle name collision for existing code, we can't properly handle our own introduction of name collision. That is, in the rare case of having to add two locks/semaphores, etc to a file, we will re-use the same variable name. This is not as terrible as it sounds since it will just re-create the thread. However, it isn't great. I tried to look into why self.find_used_names_in_module() doesn't reflect the added lock variable when the second with condition is called. It seems to be because libcst is calling on these two with statements in sequence, one after the other, and the global state isn't getting updated.

So what should we do? I'm sure libcst is smart enough to be able to handle this, but not with the current architecture. The codemod will have to be a bit more complex to handle this relatively rare case. I'd like to propose we accept the current implementation with this edge case as is and come back to it. This is an interesting yet LOW importance codemod, so I'd like to move on to more important ones. The added unit tests currently demonstrate this drawback.

andrecsilva · 2023-10-16T13:29:27Z

Are we properly handling multiple changes to the same module?
Are we properly handling nested cases?

I have added the unit test for these and what we find is that while we can properly handle name collision for existing code, we can't properly handle our own introduction of name collision. That is, in the rare case of having to add two locks/semaphores, etc to a file, we will re-use the same variable name. This is not as terrible as it sounds since it will just re-create the thread. However, it isn't great. I tried to look into why self.find_used_names_in_module() doesn't reflect the added lock variable when the second with condition is called. It seems to be because libcst is calling on these two with statements in sequence, one after the other, and the global state isn't getting updated.

So what should we do? I'm sure libcst is smart enough to be able to handle this, but not with the current architecture. The codemod will have to be a bit more complex to handle this relatively rare case. I'd like to propose we accept the current implementation with this edge case as is and come back to it. This is an interesting yet LOW importance codemod, so I'd like to move on to more important ones. The added unit tests currently demonstrate this drawback.

In libcst, the tree won't actually change while visiting the nodes. That is, when you return a node within a leave_* method, the changes won't really be "committed" to the tree right away. The tree will only actually transform after the visiting is done (this is the reason things like RemovalSentinel exists). Thus, the metadata is calculated against the original tree when visiting. This is by design and unlikely to change.

An naive solution to that is to update a single call per codemod run. This way the metadata will be accurate after every change. However this will be fairly expensive and a simple solution of keeping track of the newly added names should do.

I can probably scrounge a more complex (and general) solution that is scope aware and can deal with the edge cases within a single codemod call. I don't mind revisiting this later after I'm done with my current tasks.

clavedeluna · 2023-10-16T13:59:57Z

Alright! I believe I've now addressed all issues in this PR.

drdavella

I have added the unit test for these and what we find is that while we can properly handle name collision for existing code, we can't properly handle our own introduction of name collision. That is, in the rare case of having to add two locks/semaphores, etc to a file, we will re-use the same variable name. This is not as terrible as it sounds since it will just re-create the thread. However, it isn't great. I tried to look into why self.find_used_names_in_module() doesn't reflect the added lock variable when the second with condition is called. It seems to be because libcst is calling on these two with statements in sequence, one after the other, and the global state isn't getting updated.

Why can't we just keep a list of variable names that we have created and make sure that we don't collide with any of those either? For example, if the codemod adds a variable named lock, can't we just record that somehow and make sure we don't reuse that name in a subsequent change?

clavedeluna · 2023-10-16T18:40:08Z

I have added the unit test for these and what we find is that while we can properly handle name collision for existing code, we can't properly handle our own introduction of name collision. That is, in the rare case of having to add two locks/semaphores, etc to a file, we will re-use the same variable name. This is not as terrible as it sounds since it will just re-create the thread. However, it isn't great. I tried to look into why self.find_used_names_in_module() doesn't reflect the added lock variable when the second with condition is called. It seems to be because libcst is calling on these two with statements in sequence, one after the other, and the global state isn't getting updated.

Why can't we just keep a list of variable names that we have created and make sure that we don't collide with any of those either? For example, if the codemod adds a variable named lock, can't we just record that somehow and make sure we don't reuse that name in a subsequent change?

I was able to do just that finally. Sorry for the back and forth, I needed some further explanation on the review from Andre. I now am able to do exactly what we want.

clavedeluna · 2023-10-16T18:40:56Z

tests/codemods/test_with_threading_lock.py

+with lock_1:
+    lock = threading.Lock()
+    with lock:
+        print()


notice this test. libcst processes the internal with statement first, so our naming will be like this. Not a huuuge deal

drdavella

Sorry for the continued back and forth but I'm asking for another test case which should help clarify @andrecsilva's question.

drdavella · 2023-10-16T20:02:16Z

src/codemodder/codemods/utils_mixin.py

+    def find_used_names_in_module(self):
+        """
+        Find all the used names in the scope of a libcst Module.
+        """
+        names = []
+        scope = self.find_global_scope()
+        if scope is None:
+            return []  # pragma: no cover
+
+        nodes = [x.node for x in scope.assignments]
+        for other_nodes in nodes:
+            visitor = GatherNamesVisitor()
+            other_nodes.visit(visitor)
+            names.extend(visitor.names)
+        return names
+
+    def find_global_scope(self):
+        """Find the global scope for a libcst Module node."""
+        scopes = self.context.wrapper.resolve(ScopeProvider).values()
+        for scope in scopes:
+            if isinstance(scope, GlobalScope):
+                return scope
+        return None  # pragma: no cover


This is more of a question but I'm unclear what "global" scope means in this context. Is it the module-level scope? Or is it some kind of scope containing all variables declared anywhere in the module? I would be rather surprised to find out that it's the latter. But here's a potential test case:

# no conflicts declared at module-level scope def my_func(): # Will we correctly detect this conflict? lock = "whatever" with threading.Lock(): foo()

I'm definitely curious about this, especially since I think all of the existing test cases are implemented in terms of module-level scope.

andrecsilva

Small changes to a test.

tests/codemods/test_with_threading_lock.py

drdavella

Looks great, thanks for all of the back-and-forth 😅

clavedeluna force-pushed the name-resolution branch from 31cb417 to 9d69aff Compare October 11, 2023 12:15

clavedeluna marked this pull request as ready for review October 11, 2023 12:24

clavedeluna requested review from drdavella and andrecsilva as code owners October 11, 2023 12:24

clavedeluna commented Oct 11, 2023

View reviewed changes

src/core_codemods/with_threading_lock.py Outdated Show resolved Hide resolved

clavedeluna mentioned this pull request Oct 11, 2023

add threading codemod docs pixee/docs#88

Merged

andrecsilva requested changes Oct 11, 2023

View reviewed changes

src/codemodder/codemods/utils_mixin.py Show resolved Hide resolved

src/core_codemods/with_threading_lock.py Outdated Show resolved Hide resolved

clavedeluna requested a review from andrecsilva October 11, 2023 16:46

clavedeluna force-pushed the name-resolution branch 2 times, most recently from 9c05109 to b22a4d4 Compare October 12, 2023 11:06

drdavella requested changes Oct 12, 2023

View reviewed changes

andrecsilva reviewed Oct 16, 2023

View reviewed changes

clavedeluna force-pushed the name-resolution branch from 4142817 to f5054e1 Compare October 16, 2023 12:46

clavedeluna requested review from andrecsilva and drdavella October 16, 2023 12:52

clavedeluna force-pushed the name-resolution branch 2 times, most recently from 489832f to 330bb6b Compare October 16, 2023 13:07

clavedeluna force-pushed the name-resolution branch from 330bb6b to 8784bce Compare October 16, 2023 13:59

drdavella reviewed Oct 16, 2023

View reviewed changes

clavedeluna requested a review from drdavella October 16, 2023 18:40

clavedeluna commented Oct 16, 2023

View reviewed changes

drdavella requested changes Oct 16, 2023

View reviewed changes

clavedeluna added 3 commits October 17, 2023 07:55

add name resolution to threading codemod

2e584f4

gather all names in global scope

084851a

add no cover to just-in-case lines

08138e6

clavedeluna added 3 commits October 17, 2023 07:55

add test cases for multiple locks in a module

9cc3f8a

use type-specific names

a27a063

keep track of additional locks added

969d6bb

clavedeluna force-pushed the name-resolution branch from 8784bce to fb55680 Compare October 17, 2023 10:55

andrecsilva requested changes Oct 17, 2023

View reviewed changes

tests/codemods/test_with_threading_lock.py Outdated Show resolved Hide resolved

add unit test

800fb77

clavedeluna force-pushed the name-resolution branch from fb55680 to 800fb77 Compare October 17, 2023 11:25

clavedeluna requested review from drdavella and andrecsilva October 17, 2023 11:25

andrecsilva approved these changes Oct 17, 2023

View reviewed changes

drdavella approved these changes Oct 17, 2023

View reviewed changes

clavedeluna merged commit 8b93bae into main Oct 17, 2023
8 checks passed

clavedeluna deleted the name-resolution branch October 17, 2023 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add name resolution to threading codemod #71

add name resolution to threading codemod #71

clavedeluna commented Oct 11, 2023 •

edited

Loading

codecov bot commented Oct 11, 2023 •

edited

Loading

drdavella left a comment

drdavella Oct 12, 2023

clavedeluna Oct 16, 2023

andrecsilva Oct 16, 2023 •

edited

Loading

clavedeluna Oct 16, 2023

andrecsilva Oct 16, 2023

clavedeluna Oct 16, 2023

drdavella Oct 16, 2023

clavedeluna Oct 17, 2023

andrecsilva Oct 17, 2023

clavedeluna commented Oct 16, 2023

andrecsilva commented Oct 16, 2023

clavedeluna commented Oct 16, 2023

drdavella left a comment

clavedeluna commented Oct 16, 2023

clavedeluna Oct 16, 2023

drdavella left a comment

drdavella Oct 16, 2023

andrecsilva left a comment

drdavella left a comment

add name resolution to threading codemod #71

add name resolution to threading codemod #71

Conversation

clavedeluna commented Oct 11, 2023 • edited Loading

Overview

Description

codecov bot commented Oct 11, 2023 • edited Loading

Codecov Report

drdavella left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrecsilva Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clavedeluna commented Oct 16, 2023

andrecsilva commented Oct 16, 2023

clavedeluna commented Oct 16, 2023

drdavella left a comment

Choose a reason for hiding this comment

clavedeluna commented Oct 16, 2023

Choose a reason for hiding this comment

drdavella left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrecsilva left a comment

Choose a reason for hiding this comment

drdavella left a comment

Choose a reason for hiding this comment

clavedeluna commented Oct 11, 2023 •

edited

Loading

codecov bot commented Oct 11, 2023 •

edited

Loading

andrecsilva Oct 16, 2023 •

edited

Loading