Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring '''get_extractor''' in capa/main.py #1842

Closed
wants to merge 24 commits into from

Conversation

aaronatp
Copy link
Contributor

@aaronatp aaronatp commented Nov 8, 2023

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

These edits address issue #1813.

The edits are meant to ensure that '''get_extractor''' does not perform more than one significant function. In addition, I hope these edits make the logic of the '''get_extractor''' function more clear. With the edits, it is apparent that the function: 1) performs some preliminary error checking; 2) checks whether the '''format_''' or '''backend''' match any existing extractor handling routines; and, 3) if not, raises an error.

Unfortunately, I pushed these changes too early and am now unable to make further edits. In lines 597, "format" should be changed to "format_." In addition, there are three blank lines between some functions - this is inconsistent with the other code in the capa/main.py file, and because of PEP8, these three blank lines should be replaced with two blank lines. I believe these edits are otherwise compliant with the CONTRIBUTING.md guide. Please let me know if I have overlooked something.

I do not believe I have a PR address.

If there are any other questions or concerns about these changes, I would be happy to enhance these edits further.

Copy link
Contributor Author

@aaronatp aaronatp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code fails the "code_style" test for these reasons:

"capa/main.py:546:16: F401 binaryninja imported but unused; consider using importlib.util.find_spec to test for availability
capa/main.py:547:33: F401 binaryninja.BinaryView imported but unused; consider using importlib.util.find_spec to test for availability
capa/main.py:561:13: F821 Undefined name BinaryView
capa/main.py:561:26: F821 Undefined name binaryninja
Found 4 errors."

These errors seem to refer to the "import_binja" function, which imports "binaryninja" and "BinaryView"; and the "handle_binja_backend" function, which uses "binaryninja" and "BinaryView."

The test seems to be intended to prevent the use of a function that is not for-sure imported. But the function that uses "binaryninja" and "BinaryView" is only called by the function that imports "binaryninja" and "BinaryView"; and the function that imports "binaryview" and "BinaryView" is always called by the function that uses those modules.

Therefore, even though this code fails the "code_style" test, it doesn't seem like this is necessarily an issue.

@mr-tz
Copy link
Collaborator

mr-tz commented Nov 8, 2023

Some changes here look good. The issue was initially meant to simplify the invocation and handling of samples around

  • identifying the format and
  • getting an extractor

Do you see potential for those and would you also consider #1821 while taking a look into this?

capa/main.py Outdated

if os_ == OS_AUTO and not is_supported_os(path):
raise UnsupportedOSError()
check_supported_format(path, os_)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconsidering, I don't feel like these additional functions necessarily make the code easier to read

especially here where it's not obvious that this may raise an exception...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not something I thought about before, but this is helpful to consider. Thank you!

capa/main.py Outdated
import capa.features.extractors.dnfile.extractor

return capa.features.extractors.dnfile.extractor.DnfileFeatureExtractor(path)
return handle_dotnet_format(format_)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or here where only a few lines are wrapped

So, I'd vote to leave this function as is.

@aaronatp
Copy link
Contributor Author

aaronatp commented Nov 8, 2023

Hi @mr-tz ,

Thanks for these comments, and for explaining what issue #1813 was meant to achieve.

I think that refactoring some of the '''get_extractor''' format-identifying components could potentially reduce the function's complexity. Looking at the original code that specifically handles the binja and viv extractors, the code looks reasonably simple, although I could be wrong. Are there particular parts of the code for identifying the format and getting the extractor that you think could be simplified? Are you referring to the invocation of '''get_extractor''' in the '''main''' function? 

I agree with what @williballenthin indicated in his proposals in #1821, which is that '''get_extractor''' and many of the helper functions that it relies on should probably be moved to a different file. Since '''get_extractor''' and the helper functions it relies on have a relatively specific goal (to return FeatureExtractor objects), it may even be worth moving '''get_extractor''' and its helper functions to a feature_extractor.py file. Are there other things about #1821 you think may be worth considering?

In addition, it may be a good idea to move the try-except statement in viv-extractor code to a new function. While the try-except clause is not especially complex, I found it a bit visually unwieldy, especially considering that the clause is within both a "with" and "if-else" statement.

Also, thank you for pointing out that the names of error handling functions should make clear that they may raise exceptions. This is not something that I had considered before, but I certainly see the value. I have changed the name of the '''is_supported_format''' function to '''check_unsupported_raise_exception''' to reflect this. I have also changed the name of '''import_binja''' to '''attempt_binja_import''', and created '''attempt_save_workspace''', to reflect the error handling abilities of these functions.

Please let me know if you have any more questions or comments about these edits, or if you think I should make further enhancements.

Best,

Aaron

capa/main.py Outdated

if os_ == OS_AUTO and not is_supported_os(path):
raise UnsupportedOSError()
check_unsupported_raise_exception(path, os_)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to just leave the code verbatim here instead of in a new function. Or do you see much benefit added by the function?

capa/main.py Outdated
def handle_binja_backend(path: Path, disable_progress: bool) -> FeatureExtractor:
import capa.features.extractors.binja.extractor

attempt_binja_import()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can also stay in here and likely resolves the ruff error?

capa/main.py Outdated

if should_save_workspace:
logger.debug("saving workspace")
attempt_save_workspace(vw)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also simple enough to just leave here

@mr-tz
Copy link
Collaborator

mr-tz commented Nov 10, 2023

Thanks for the details and code suggestions!!

Are you referring to the invocation of '''get_extractor''' in the '''main''' function?

Yes, that's what I meant originally. Can we wrap the calling and error handling to get an extractor?

Are there other things about #1821 you think may be worth considering?

I'd propose to start with what's listed in proposal 3 and we can further split the functions up if needed.

Overall, I still think we don't need to split the function in so many sub parts but would be very happy to see the refactor into a separate module here.

@aaronatp
Copy link
Contributor Author

Hi @mr-tz, Thank you for your comments and thoughts! Do you think that using a decorator would wrap the calling and error handling as you imagine? I have drafted a decorator to perform this error handling and submitted it for review. In addition to cleaning up the error handling for '''get_extractor''', a decorator may help to avoid some of the boilerplate code used for logging errors in capa/helpers.py. However, one benefit of the original code is that it is clear that the function returns if an exception is raised. This can allow someone to easily follow the code flow. With a decorator, the error handling is largely hidden, but this may make it more difficult for someone reading through main.py to get a good sense of what the program is doing - in addition to looking at main.py, they will have to look at where the decorator is defined. As a result, this may be a drawback of using a decorator to perform error handling for '''get_extractor'''. Please let me know if this is more similar to what you hoped to achieve with #1813. If not, I would be very happy to change it and make improvements!

Would you also be able to say a little more about how #1821 connects to refactoring the error handling for '''get_extractor'''? In particular, is there a certain way you think #1821 could influence the refactoring of the error handling? Unfortunately, because I am fairly new to contributing to large codebases, I don't have a strong understanding yet of some of these matters. I really appreciate your patience and help thus far!

@mr-tz
Copy link
Collaborator

mr-tz commented Nov 12, 2023

Nice idea, I think a decorator could work well here. I also like the potential simplification around the log_unsupported_* functions.

Let's focus this PR on these changes (rollback the first 5 commits - for now) and then work on changes related to #1821 separately as they're not directly related. Sorry for the confusion :)

capa/helpers.py Outdated
Comment on lines 130 to 134
e_list, return_values = [(UnsupportedFormatError E_INVALID_FILE_TYPE),
(UnsupportedArchError, E_INVALID_FILE_ARCH),
(UnsupportedOSError, E_INVALID_FILE_OS)]

messsage_list = [ # UnsupportedFormatError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these go into the same list of tuples?

)
logger.error(" If you don't know the input file type, you can try using the `file` utility to guess it.")
logger.error("-" * 80)
def exceptUnsupportedError(func):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use snake case names and maybe rename this to catch_log_return_errors or similar?

then let's use via @<decorator_name> (see, e.g., https://rinaarts.com/declutter-python-code-with-error-handling-decorators/)

@aaronatp
Copy link
Contributor Author

Hi @mr-tz thanks for this feedback! I have made the changes you suggested. Please let me know if you think the list unpacking looks good. I am unsure if the list unpacking is very readable with the message tuples included. I have put the decorator above the function definition. I don't know if it's possible to place the decorator above the function invocation and also assign the function output to a variable - it seems like this would be good because the potential error handling would be visible in the '''main''' function.

Also, no worries regarding #1821! I will rollback the first 5 commits. Thank you for your feedback so far! Please let me know if you have any more comments or suggestions!

This reverts commit d649897.
This reverts commit d649897.
This reverts commit d649897.
This reverts commit d649897.
This reverts commit d649897.
This reverts commit d649897.
This reverts commit d649897.
This reverts commit d649897.
This reverts commit d649897.
This reverts commit d649897.
This reverts commit d649897.
@aaronatp
Copy link
Contributor Author

Hi @mr-tz I reverted the first 5 or so commits - is this how I should roll back those commits? I am not sure if my reverting worked. If not, I was considering deleting this branch and creating a fresh one with just the most recent commits. Do you think that would be a sensible option?

This reverts commit d649897.
This reverts commit d649897.
This reverts commit d649897.
@ghost
Copy link

ghost commented Nov 13, 2023

Yes, I think creating a new branch may be the easier option at this point.

@aaronatp
Copy link
Contributor Author

@mr-tz great will do, thank you.

@mr-tz
Copy link
Collaborator

mr-tz commented Nov 13, 2023

Thanks, #1845 supersedes this so feel free to close it here.
On first glance the new PR looks good, will review in more detail later.

@aaronatp
Copy link
Contributor Author

@mr-tz sounds good, will do! Thank you!

@aaronatp aaronatp closed this Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants