Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ghidra: improve instruction string and bytes feature extraction #1885

Merged
merged 4 commits into from
Dec 25, 2023

Conversation

mike-hunhoff
Copy link
Collaborator

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two questions, but overall looks good

else:
break

yield to_addr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the yield only per reference intended here?

where we expect multiple references from an instruction?

Copy link
Collaborator Author

@mike-hunhoff mike-hunhoff Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the yield only per reference intended here?

Yes - the intention is to follow nested pointers until data (as defined by Ghidra) is reached or max_depth is reached resulting in a single yield per memory reference from an instruction.

where we expect multiple references from an instruction?

Yes - an instruction may have multiple references from it. Here we restrict our search to memory references that reference data (as defined by Ghidra).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think the IDA equivalent works differently or am I confusing the functions?

What's an example of an instruction with multiple data references?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps symbolic analysis might resolve a set of candidate references for an indirect read/write like mov [eax], 1?

also perhaps ghidra supports an architecture with a complex memory to memory copy? (just guessing).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example,

MOV dword ptr [EBP + -0x8],0x469be6
	op[0]: RefType.WRITE
	op[1]: RefType.DATA

Ghidra tracks many different reference types so to be complete we must check them all for each instruction. This level of reference tracking can be really useful.

@williballenthin
Copy link
Collaborator

i love the algorithms and methods that are being created here for capa + ghidra! i hope they serve as good documentation for the future

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mike-hunhoff
Copy link
Collaborator Author

@colton-gabertan when you have a moment can you review and see if I missed anything obvious with these changes?

@colton-gabertan colton-gabertan merged commit 22f4251 into master Dec 25, 2023
27 checks passed
@colton-gabertan colton-gabertan deleted the fix/ghidra/RuntimeError branch December 25, 2023 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants