Refactoring '''get_extractor''' in capa/main.py #1842

aaronatp · 2023-11-08T05:02:06Z

Checklist

No CHANGELOG update needed

No new tests needed

No documentation update needed

These edits address issue #1813.

The edits are meant to ensure that '''get_extractor''' does not perform more than one significant function. In addition, I hope these edits make the logic of the '''get_extractor''' function more clear. With the edits, it is apparent that the function: 1) performs some preliminary error checking; 2) checks whether the '''format_''' or '''backend''' match any existing extractor handling routines; and, 3) if not, raises an error.

Unfortunately, I pushed these changes too early and am now unable to make further edits. In lines 597, "format" should be changed to "format_." In addition, there are three blank lines between some functions - this is inconsistent with the other code in the capa/main.py file, and because of PEP8, these three blank lines should be replaced with two blank lines. I believe these edits are otherwise compliant with the CONTRIBUTING.md guide. Please let me know if I have overlooked something.

I do not believe I have a PR address.

If there are any other questions or concerns about these changes, I would be happy to enhance these edits further.

aaronatp

This code fails the "code_style" test for these reasons:

"capa/main.py:546:16: F401 binaryninja imported but unused; consider using importlib.util.find_spec to test for availability
capa/main.py:547:33: F401 binaryninja.BinaryView imported but unused; consider using importlib.util.find_spec to test for availability
capa/main.py:561:13: F821 Undefined name BinaryView
capa/main.py:561:26: F821 Undefined name binaryninja
Found 4 errors."

These errors seem to refer to the "import_binja" function, which imports "binaryninja" and "BinaryView"; and the "handle_binja_backend" function, which uses "binaryninja" and "BinaryView."

The test seems to be intended to prevent the use of a function that is not for-sure imported. But the function that uses "binaryninja" and "BinaryView" is only called by the function that imports "binaryninja" and "BinaryView"; and the function that imports "binaryview" and "BinaryView" is always called by the function that uses those modules.

Therefore, even though this code fails the "code_style" test, it doesn't seem like this is necessarily an issue.

mr-tz · 2023-11-08T13:57:12Z

Some changes here look good. The issue was initially meant to simplify the invocation and handling of samples around

identifying the format and
getting an extractor

Do you see potential for those and would you also consider #1821 while taking a look into this?

mr-tz · 2023-11-08T13:59:23Z

capa/main.py

-
-        if os_ == OS_AUTO and not is_supported_os(path):
-            raise UnsupportedOSError()
+        check_supported_format(path, os_)


reconsidering, I don't feel like these additional functions necessarily make the code easier to read

especially here where it's not obvious that this may raise an exception...

That is not something I thought about before, but this is helpful to consider. Thank you!

mr-tz · 2023-11-08T14:00:15Z

capa/main.py

-        import capa.features.extractors.dnfile.extractor
-
-        return capa.features.extractors.dnfile.extractor.DnfileFeatureExtractor(path)
+        return handle_dotnet_format(format_)


... or here where only a few lines are wrapped

So, I'd vote to leave this function as is.

aaronatp · 2023-11-08T21:32:18Z

Hi @mr-tz ,

Thanks for these comments, and for explaining what issue #1813 was meant to achieve.

I think that refactoring some of the '''get_extractor''' format-identifying components could potentially reduce the function's complexity. Looking at the original code that specifically handles the binja and viv extractors, the code looks reasonably simple, although I could be wrong. Are there particular parts of the code for identifying the format and getting the extractor that you think could be simplified? Are you referring to the invocation of '''get_extractor''' in the '''main''' function?

I agree with what @williballenthin indicated in his proposals in #1821, which is that '''get_extractor''' and many of the helper functions that it relies on should probably be moved to a different file. Since '''get_extractor''' and the helper functions it relies on have a relatively specific goal (to return FeatureExtractor objects), it may even be worth moving '''get_extractor''' and its helper functions to a feature_extractor.py file. Are there other things about #1821 you think may be worth considering?

In addition, it may be a good idea to move the try-except statement in viv-extractor code to a new function. While the try-except clause is not especially complex, I found it a bit visually unwieldy, especially considering that the clause is within both a "with" and "if-else" statement.

Also, thank you for pointing out that the names of error handling functions should make clear that they may raise exceptions. This is not something that I had considered before, but I certainly see the value. I have changed the name of the '''is_supported_format''' function to '''check_unsupported_raise_exception''' to reflect this. I have also changed the name of '''import_binja''' to '''attempt_binja_import''', and created '''attempt_save_workspace''', to reflect the error handling abilities of these functions.

Please let me know if you have any more questions or comments about these edits, or if you think I should make further enhancements.

Best,

Aaron

mr-tz · 2023-11-10T08:15:55Z

capa/main.py

-
-        if os_ == OS_AUTO and not is_supported_os(path):
-            raise UnsupportedOSError()
+        check_unsupported_raise_exception(path, os_)


I'd propose to just leave the code verbatim here instead of in a new function. Or do you see much benefit added by the function?

mr-tz · 2023-11-10T08:17:06Z

capa/main.py

+def handle_binja_backend(path: Path, disable_progress: bool) -> FeatureExtractor:
+    import capa.features.extractors.binja.extractor
+
+    attempt_binja_import()


I think this can also stay in here and likely resolves the ruff error?

mr-tz · 2023-11-10T08:18:06Z

capa/main.py

+
+        if should_save_workspace:
+            logger.debug("saving workspace")
+            attempt_save_workspace(vw)


this is also simple enough to just leave here

mr-tz · 2023-11-10T08:19:41Z

Thanks for the details and code suggestions!!

Are you referring to the invocation of '''get_extractor''' in the '''main''' function?

Yes, that's what I meant originally. Can we wrap the calling and error handling to get an extractor?

Are there other things about #1821 you think may be worth considering?

I'd propose to start with what's listed in proposal 3 and we can further split the functions up if needed.

Overall, I still think we don't need to split the function in so many sub parts but would be very happy to see the refactor into a separate module here.

aaronatp · 2023-11-11T02:08:21Z

Hi @mr-tz, Thank you for your comments and thoughts! Do you think that using a decorator would wrap the calling and error handling as you imagine? I have drafted a decorator to perform this error handling and submitted it for review. In addition to cleaning up the error handling for '''get_extractor''', a decorator may help to avoid some of the boilerplate code used for logging errors in capa/helpers.py. However, one benefit of the original code is that it is clear that the function returns if an exception is raised. This can allow someone to easily follow the code flow. With a decorator, the error handling is largely hidden, but this may make it more difficult for someone reading through main.py to get a good sense of what the program is doing - in addition to looking at main.py, they will have to look at where the decorator is defined. As a result, this may be a drawback of using a decorator to perform error handling for '''get_extractor'''. Please let me know if this is more similar to what you hoped to achieve with #1813. If not, I would be very happy to change it and make improvements!

Would you also be able to say a little more about how #1821 connects to refactoring the error handling for '''get_extractor'''? In particular, is there a certain way you think #1821 could influence the refactoring of the error handling? Unfortunately, because I am fairly new to contributing to large codebases, I don't have a strong understanding yet of some of these matters. I really appreciate your patience and help thus far!

mr-tz · 2023-11-12T11:13:37Z

Nice idea, I think a decorator could work well here. I also like the potential simplification around the log_unsupported_* functions.

Let's focus this PR on these changes (rollback the first 5 commits - for now) and then work on changes related to #1821 separately as they're not directly related. Sorry for the confusion :)

mr-tz · 2023-11-12T11:14:28Z

capa/helpers.py

+    e_list, return_values = [(UnsupportedFormatError E_INVALID_FILE_TYPE),
+                             (UnsupportedArchError, E_INVALID_FILE_ARCH), 
+                             (UnsupportedOSError, E_INVALID_FILE_OS)]
+
+    messsage_list = [ # UnsupportedFormatError


can these go into the same list of tuples?

mr-tz · 2023-11-12T11:16:55Z

capa/helpers.py

-    )
-    logger.error(" If you don't know the input file type, you can try using the `file` utility to guess it.")
-    logger.error("-" * 80)
+def exceptUnsupportedError(func):


let's use snake case names and maybe rename this to catch_log_return_errors or similar?

then let's use via @<decorator_name> (see, e.g., https://rinaarts.com/declutter-python-code-with-error-handling-decorators/)

aaronatp · 2023-11-13T02:09:10Z

Hi @mr-tz thanks for this feedback! I have made the changes you suggested. Please let me know if you think the list unpacking looks good. I am unsure if the list unpacking is very readable with the message tuples included. I have put the decorator above the function definition. I don't know if it's possible to place the decorator above the function invocation and also assign the function output to a variable - it seems like this would be good because the potential error handling would be visible in the '''main''' function.

Also, no worries regarding #1821! I will rollback the first 5 commits. Thank you for your feedback so far! Please let me know if you have any more comments or suggestions!

This reverts commit d649897.

aaronatp · 2023-11-13T07:02:55Z

Hi @mr-tz I reverted the first 5 or so commits - is this how I should roll back those commits? I am not sure if my reverting worked. If not, I was considering deleting this branch and creating a fresh one with just the most recent commits. Do you think that would be a sensible option?

This reverts commit d649897.

ghost · 2023-11-13T07:29:06Z

Yes, I think creating a new branch may be the easier option at this point.

aaronatp · 2023-11-13T07:37:13Z

@mr-tz great will do, thank you.

mr-tz · 2023-11-13T10:56:33Z

Thanks, #1845 supersedes this so feel free to close it here.
On first glance the new PR looks good, will review in more detail later.

aaronatp · 2023-11-13T18:23:19Z

@mr-tz sounds good, will do! Thank you!

aaronatp added 3 commits November 7, 2023 17:49

Update main.py

d061e0c

Incremental PR improvements

d46fa26

Update main.py

222cd6c

aaronatp commented Nov 8, 2023

View reviewed changes

Update main.py

d649897

mr-tz reviewed Nov 8, 2023

View reviewed changes

Update main.py

0aab720

mr-tz reviewed Nov 10, 2023

View reviewed changes

aaronatp added 3 commits November 10, 2023 17:34

Updated handling for logging Unsupported Format, Arch, and OS

50b4b06

Update helpers.py

a9ead12

Update main.py

bc616d0

mr-tz requested changes Nov 12, 2023

View reviewed changes

aaronatp added 2 commits November 12, 2023 17:22

Update helpers.py

329ac2d

Update main.py

4162c90

aaronatp added 7 commits November 12, 2023 20:57

Revert "Update main.py"

3533550

This reverts commit d649897.

Revert "Update main.py"

c8a2003

This reverts commit d649897.

Revert "Update main.py"

3d78316

This reverts commit d649897.

Revert "Update main.py"

932b36e

This reverts commit d649897.

Revert "Update main.py"

b755227

This reverts commit d649897.

Revert "Update main.py"

98d15fd

This reverts commit d649897.

Revert "Update main.py"

d3aead9

This reverts commit d649897.

aaronatp added 4 commits November 12, 2023 20:59

Revert "Update main.py"

4247a94

This reverts commit d649897.

Revert "Update main.py"

faa4c0c

This reverts commit d649897.

Revert "Update main.py"

f96a5ff

This reverts commit d649897.

Revert "Update main.py"

10d8a20

This reverts commit d649897.

aaronatp added 3 commits November 12, 2023 23:12

Revert "Update main.py"

4acd8cd

This reverts commit d649897.

Revert "Update main.py"

88d725f

This reverts commit d649897.

Revert "Update main.py"

f537838

This reverts commit d649897.

aaronatp closed this Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring '''get_extractor''' in capa/main.py #1842

Refactoring '''get_extractor''' in capa/main.py #1842

aaronatp commented Nov 8, 2023 •

edited

Loading

aaronatp left a comment •

edited

Loading

mr-tz commented Nov 8, 2023

mr-tz Nov 8, 2023

aaronatp Nov 8, 2023

mr-tz Nov 8, 2023

aaronatp commented Nov 8, 2023 •

edited

Loading

mr-tz Nov 10, 2023

mr-tz Nov 10, 2023

mr-tz Nov 10, 2023

mr-tz commented Nov 10, 2023 •

edited

Loading

aaronatp commented Nov 11, 2023

mr-tz commented Nov 12, 2023

mr-tz Nov 12, 2023

mr-tz Nov 12, 2023

aaronatp commented Nov 13, 2023

aaronatp commented Nov 13, 2023

ghost commented Nov 13, 2023

aaronatp commented Nov 13, 2023

mr-tz commented Nov 13, 2023

aaronatp commented Nov 13, 2023

Refactoring '''get_extractor''' in capa/main.py #1842

Refactoring '''get_extractor''' in capa/main.py #1842

Conversation

aaronatp commented Nov 8, 2023 • edited Loading

Checklist

aaronatp left a comment • edited Loading

Choose a reason for hiding this comment

mr-tz commented Nov 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronatp commented Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mr-tz commented Nov 10, 2023 • edited Loading

aaronatp commented Nov 11, 2023

mr-tz commented Nov 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronatp commented Nov 13, 2023

aaronatp commented Nov 13, 2023

ghost commented Nov 13, 2023

aaronatp commented Nov 13, 2023

mr-tz commented Nov 13, 2023

aaronatp commented Nov 13, 2023

aaronatp commented Nov 8, 2023 •

edited

Loading

aaronatp left a comment •

edited

Loading

aaronatp commented Nov 8, 2023 •

edited

Loading

mr-tz commented Nov 10, 2023 •

edited

Loading