Skip to content

relation extraction

testak edited this page Aug 31, 2017 · 3 revisions

Relation Extraction

Library to create vertices and edges from a document annotated with cyber-entity labels and a set of SVMs and feature models to predict relationships between these cyber entities.

Entity Types

  • Software
    • Vendor
    • Product
    • Version
  • File
    • Name
  • Function
    • Name
  • Vulnerability
    • Name
    • Description
    • CVE
    • MS

Relationship Types

  • ExploitTargetRelatedObservable Edge

      Exploit Target (e.g. vulnerability) --> Observable (e.g. software)
    
  • Sub-Observable Edge

      Observable (e.g. software) --> Observable (e.g. file)
    
  • Software, File, Function, Vulnerability Vertices

      Software/file/function/vulnerability properties are part of the same vertex
      
      Example: "... **MS15-035**, which addresses a **remote code execution** bug ..."
      "MS15-035" is extracted as a vulnerability MS property, and "remote code execution" is extracted as a vulnerability description property. This type of relationship indicates that both properties are describing the same vulnerability object.
    

Input

  • Output from the Entity-Extractor as an Annotation object, which represents the sentences, list of words from the text, along with each word's part of speech tag and cyber domain label.
  • The String name of the document's source
  • The String name of the document's title

Current Process

  • Pre-trained Word2Vec model
  • Pre-trained SVM models, one for each relationship and entities' order of appearance
  • Pre-generated feature maps, one for each relationship and enities' order of appearance
  • NVD XML files are used to find examples of the relationships
  • For each Annotated document:
    1. Use NVD files to find known examples of relationships in document
    2. Use Word2Vec model to encode each token of the document
    3. Use feature maps to generate feature vectors for each token of the document
    4. Use pre-trained SVM models with the document's feature vectors to predict relationships between cyber entities

Note: Refer to relation-bootstrap repo for more information on the process.

Output

  • A JSON-formatted subgraph of the vertices and edges is created, which loosely resembles the STIX data model

     {
     	"vertices": {
     		"1235": {
     			"name": "1235",
     			"vertexType": "software",
     			"product": "Windows XP",
     			"vendor": "Microsoft",
     			"source": "CNN"
     		},
     		...
     		"1240": {
     			"name": "file.php",
     			"vertexType": "file",
     			"source": "CNN"
     		}
     	},
     	"edges": [
     		{
     			"inVertID": "1237",
     			"outVertID": "1238",
     			"relation": "ExploitTargetRelatedObservable"
     		},
     		{
     			"inVertID": "1240",
     			"outVertID": "1239",
     			"relation": "Sub-Observable"
     		}
     	]
     }
    
Clone this wiki locally