Skip to content
testak edited this page Sep 26, 2017 · 12 revisions

Stucco Architecture

Architecture diagram

Building

  • Dev-Setup: will set up the test and demonstration environment for Stucco
  • Demo: will set up the demonstration environment for Stucco using Packer

Collection

Description

The collectors pull data or process data streams and push the collected data (documents) into the message queue. Each type of collector is independent of others. The collectors can be implemented in any language.

Collectors can either send messages with data or messages without data. For messages without data, the collector will add the document to the document store and attach the returned id to the message.

Collectors can either be stand-alone and run on any host, or be host-based and designed to collect data specific to that host.

Collector Types

Web collector

Web collectors pull a document via HTTP/HTTPS given a URL. Documents will be decompressed, but no other processing will occur.

Content format

Various (e.g. HTML, XML, CSV).

Scraping collector

Scrapers pull data embedded within a web page via HTTP/HTTPS given a URL and an HTML pattern.

Content format

HTML.

RSS collector

RSS collectors pull an RSS/ATOM feed via HTTP/HTTPS given a URL.

Content format

XML.

Twitter collector

Twitter collectors pull Tweet data via HTTP from the Twitter Search REST API given a user (@username), hashtag (#keyword), or search term.

Content format

JSON.

Netflow collector

Netflow collectors will collect from Argus. The collector will listen for argus streams using ra tool and convert to XML and pipe to send the flow data to the message queue as a string.

Content format

String.

Host-based collectors

Host-based collectors collect data from an individual host using agents.

Host-based collectors should be able to collect and forward:

  • System logs
  • Hone data
  • Installed packages
Content format

If we are writing the collector, JSON. If not, whatever format the agent uses.

State

Stand-alone collectors may require state (state should be stored with the scheduler, such as the last time a site was downloaded). Host-based collectors may need to store state (e.g. when the last collection was run).

Post-Processing

Even after collection has taken place there the content may required additional handling. For example, the NVD source is tar'd and gzipped. We specifically provide a post-processing method that will untar and unzip the file before it is further sent on through the pipeline. We've added following post-processing actions on the content:

  • unzip: uncompresses the content by first determining the compression type based on the file extension .gz, bz2, etc...
  • tar-unzip: untar's the files content prior to uncompressing the content.
  • removeHTML: Applies the Boilerpipe process to the content to extract the base text content, ignores the boilerplate/template content in a webpage. In addition, it also use the Apache TIKA library to extract the documents metadata. Recommended for use on all unstructured text sources.

Input Transport Protocol

Input transport protocol will depend on the type of collector.

Input Format

Input format will depend on the type of collector.

Output Transport Protocol

Advanced Message Queuing Protocol (AMQP), as implemented in RabbitMQ. See the concepts documentation for information about AMQP and RabbitMQ concepts. See the protocol documentation for more on AMQP. Examples below are in Go using the amqp package. Other libraries should implement similar interfaces.

The RabbitMQ exchange is exchange-type of topic with the exchange-name of stucco.

The exchange declaration options should be:

"topic",    // type
true,       // durable
false,      // auto-deleted
false,      // internal
false,      // noWait
nil,        // arguments

The publish options should be:

stucco,     // publish to an exchange named stucco
<routingKey>, // routing to 0 or more queues
false,      // mandatory
false,      // immediate

The <routingKeys> format should be: stucco.in.<data-type>.<source-name>.<data-name (optional)>, where:

  • data-type (required): the type of data, either 'structured' or 'unstructured'
  • source-name (required): the source of the collected data, such as cve, nvd, maxmind, cpe, argus, hone.
  • data-name (optional): the name of the data, such as the hostname of the sensor.

The message options should be:

    DeliveryMode:    1,    // 1=non-persistent, 2=persistent
    Timestamp:       time.Now(),
    ContentType:     "text/plain",
    ContentEncoding: "",
    Priority:        1,    // 0-9
    HasContent:      true, // boolean
    Body:            <payload>,

DeliveryMode should be 'persistent'.

Timestamp should be automatically filled out by your amqp client library. If not, the publisher should specify.

ContentType should be "text/xml" or "text/csv" or "application/json" or "text/plain" (i.e. collectorType from the output format). This is dependent on the data source.

ContentEncoding may be required if things are, for example, gzipped.

Priority is optional.

HasContent is an application-specific part of the message header that defines whether or not there is content as part of the message. It should be defined in the message header field table using a boolean: HasContent: true (if there is data content) or HasContent: false (if the document service has the content). The spout will use the document service accordingly. This is the only application-specific data needed.

Body is the payload, either the document itself or the id if HasContent is false.

The corresponding binding keys for the queue defined in the spout will can use the wildcards to determine which spout should handle which messages:

  • * (star) can substitute for exactly one word.
  • # (hash) can substitute for zero or more words.

For example, stucco.in.# would listen for all input.

Output Format

There are two types of output messages: (1) messages with data and (2) messages without data that reference an ID in the document store.


Scheduler

Description

The Scheduler is a Java application that uses the Quartz Scheduler library for running the schedule. The Scheduler instantiates and runs collectors at the scheduled times. The schedule is specified in a configuration file.

From a narrow implementation perspective, that's all the Scheduler does. However, from a broader architectural perspective, it makes sense to discuss major aspects of collection control together. Accordingly, we discuss configuration options and redundancy control here, even though most of their actual implementation is part of the collectors.

Configuration

The schedule is maintained in the main Stucco configuration file, stucco.yml. In normal operation, the schedule is loaded into the etcd configuration service, and the Scheduler reads it from there. For development and testing purposes, the schedule can also be read directly from file.

Running

The Scheduler's main class is gov.pnnl.stucco.utilities.CollectorScheduler. It recognizes the following switches:

  • -section . This tells the Scheduler what section of the configuration to use. It is currently a required switch and should be specified as "–section demo-load".
  • -file . This tells the Scheduler to read the collector configuration from the given YAML file, typically stucco.yml.
  • -url . This tells the Scheduler to read the collector configuration from the etcd service’s URL, which will typically be http://10.10.10.100:4001/v2/keys/ (the actual IP may vary depending on your setup). Alternatively, inside the VM, you can use localhost instead of the IP.

Schedule Format

Each exogenous collector’s configuration contains information about how and when to collect a source. Example from a configuration file:

default:
…
  scheduler:
    collectors:
      -
        source-name: Bugtraq
        type: PSEUDO_RSS
        data-type: unstructured
        source-URI: http://www.securityfocus.com/vulnerabilities
        content-type: text/html
        crawl-delay: 2
        entry-regex: 'href="(/bid/\d+)"'
        tab-regex: 'href="(/bid/\d+/(info|discuss|exploit|solution|references))"'
        next-page-regex: 'href="(/cgi-bin/index\.cgi\?o[^"]+)">Next &gt;<'
        cron: 0 0 23 * * ?
        now-collect: all`

source-name

The name of the source, used primarily as a key for RT.

type

The type key specifies the primary kind of collection for a source. Here's one way to categorize the types.

Generic Collectors

Collectors used to handle the most common cases:

  • RSS: An RSS feed
  • PSEUDO_RSS: A Web page acting like an RSS feed, potentially with multiple pages, multiple entries per page, and multiple subpages (tabs) per entry. This uses regular expressions to scrape the URLs it needs to traverse.
  • TABBED_ENTRY: A Web page with multiple subpages (tabs). In typical use, this will be a delegate for one of the above collectors, and won't be scheduled directly.
  • WEB: A single Web page. In typical use, this will be a delegate for one of the above collectors, and won't be scheduled directly.
Site-Specific Collectors

Collectors custom-developed for a specific source:

  • NVD: The National Vulnerability Database
  • BUGTRAQ: The Bugtraq pseudo-RSS feed. (Deprecated) Use PSEUDO_RSS.
  • SOPHOS: The Sophos RSS feed. (Deprecated) Use RSS with a tab-regex.
Disk-Based Collectors

Collectors used for test/debug, to "play back" previously-captured data:

  • FILE: A file on disk
  • FILEBYLINE: A file, treated as one document per line
  • DIRECTORY: A directory on disk
source-uri

The URI for a source.

crawl-delay

The minimum number of seconds to wait between requests to a site.

*-regex

The collectors use regular expressions (specifically Java regexes) to scrape additional links to traverse. There are currently keys for three kinds of links:

  • entry-regex: In a PSEUDO_RSS feed, this regex is used to identify the individual entries.
  • tab-regex: In an RSS or PSEUDO_RSS feed, this regex is used to identify the subpages (tabs) of a page.
  • next-page-regex: In a PSEUDO_RSS feed, this regex is used to identify the next page of entries.
cron

When to collect is specified in the form of a Quartz scheduler cron expression.

CAUTION: Quartz's first field is SECONDS, not MINUTES like some crons. There are seven whitespace-delimited fields (six required, one optional):

s m h D M d [Y] These are seconds, minutes, hours, day of month, month, day of week, and year

  • Specify * to mean “every”
  • Exactly one of the D/d fields must be specified as ? to indicate it isn’t used In addition, we support specifying a cron expression of now, to mean “immediately run once”.

now-collect

The now-collect configuration key is intended as an improvement on the now cron option, offering more nuanced control over scheduler start-up behavior. This key can take the following values:

  • all: Collect as much as possible, skipping URLs already collected
  • new: Collect as much as possible, but stop once we find a URL that's already collected
  • none: Collect nothing; just let the regular schedule do it

Reducing Redundant Collection

Most of the Scheduler consists of fairly straightforward use of Quartz. The one area that is slightly more complicated is the logic used to try to prevent or at least reduce redundant collection and messaging. We’re trying to avoid collecting pages that haven’t changed since the last collection. Sometimes we may not have sufficient information to avoid such redundant collection, but we can still try to detect the redundancy and avoid re-messaging the content to the rest of Stucco.

Our strategy is to use built-in HTTP features to prevent redundant collection where possible, and to use internal bookkeeping to detect redundant collection when it does happen. We implement this strategy using the following tactics:

  • We use HTTP HEAD requests to see if GET requests are necessary. In some cases the HEAD request will be enough to tell that there is nothing new to collect.
  • We make both HTTP HEAD and GET requests conditional, using HTTP’s If-Modified-Since and If-None-Match request headers. If-Modified-Since checks against a timestamp. If-None-Match checks against a previously returned response header called an ETag (entity tag). An ETag is essentially an ID of some sort, often a checksum.
  • We record a SHA-1 checksum on collected content, so we check it for a match the next time. This is necessary because not all sites run the conditional checks. For a feed, the checksum is performed on the set of feed URLs.

Because of the timing of the various checks, they are conducted within the collectors.

The internal bookkeeping is currently kept in the CollectorMetadata.db file. Each entry is a whitespace-delimited line containing URL, last collection time, SHA-1 checksum, and UUID.

State

The Scheduler runs the schedule as expected, controlling when the collectors execute. Other aspects of collection control are less complete, and need improvements in the following areas:

  • Exception Handling. Minor exceptions during collection are generally ignored. However no attempt is yet made to deal with more serious exceptions. In particular, no attempt is made to ensure that the metadata recording, document storage, and message sending are performed in a transactional manner. The Scheduler does have a shutdown hook so it can attempt to exit gracefully for planned shutdowns.
  • Collector Metadata Storage. This is currently implemented strictly as proof-of-principle. Metadata is stored to a flat file, requiring constant re-loading and re-writing of the entire file. We know this won't scale, and plan to migrate to an embedded database.
  • Leveraging robots.txt. The code does not currently read a site's robots.txt file. It should do so in order to determine the throttling setting, as well as know if it should avoid collection of some files. Currently, we can honor these in the configuration file by using the crawl-delay setting and by only specifying URLs that are fair game.

Message Queue (MQ)

Description

The message queue accepts input (documents) from the collectors and pushes the documents into the processing pipeline. The message queue is implemented with RabbitMQ, which implements the AMQP standard.

Configuration

The queue should hold messages until processing pipeline acknowledges its receipt.

Protocol

Input and output protocol is AMQP 0-9-1.

Format

The message queue should pass on the data as is from collectors.


RT

Description

RT is the Real-time processing component of Stucco.

The data it receives will be transformed into a subgraph, consistent with the STIX 1.X format, and then aligned with the knowledge graph.

Input Transport Protocol

Advanced Message Queuing Protocol (AMQP), as implemented in RabbitMQ.

Input Format

There are two types of messages: (1) messages with data and (2) messages without data that reference an ID in the document store.

RT will send an acknowledgement to the queue when the messages are received, so that the queue can release these resources.

Output Transport Protocol

Depends on the data storage technology. The current implementation is PostgreSQL relational database, which uses a jdbc driver to execute SQL statements.

Output Format

Depends on the data storage technology. The current implementation is PostgreSQL relational database, which uses SQL statements.

RT Components

Message Queue

Description

The message queue, RabbitMQ consumer, pulls messages off the queue based on the routing key contained in the message.

Input Transport Protocol

Advanced Message Queuing Protocol (AMQP), as implemented in RabbitMQ.

Input Format

See Collector's Output Format

Output Transport Protocol

Advanced Message Queuing Protocol (AMQP), as implemented in RabbitMQ.

Output Format

JSON object with the following fields:

  • source (string) - the routing key
  • timestamp (long) - the timestamp indicating when the message was collected
  • contentIncl (boolean) - indicates if the data is included in the message
  • message (string) - the data, if included in the message; the document id to retrieve the data, otherwise

Entity Extraction

Description

The Entity Extractor obtains an unstructured document's content either from the message, or by requesting the document from the document-service. The document content is then annotated with cyber-domain concepts.

Input Format

Two Java Strings:

  • document title
  • document text content
Output Format

Annotated document object (https://nlp.stanford.edu/nlp/javadoc/javanlp/Annotation) with the following information:

  • Text: original raw text
  • Sentences: list of sentences
    • Sentence: map representing one sentence
      • Token: word within the sentence
      • POSTag: part-of-speech tag
      • CyberEntity: cyber domain label for the token
    • ParseTree: sentence structure as a tree

Relation Extraction

Description

The Relation Extractor discovers relationships between the concepts and constructs a subgraph of this knowledge.

Input Format

Java String representing the data source, and an Annotated document object (https://nlp.stanford.edu/nlp/javadoc/javanlp/Annotation) with the following information:

  • Text: original raw text
  • Sentences: list of sentences
    • Sentence: map representing one sentence
      • Token: word within the sentence
      • POSTag: part-of-speech tag
      • CyberEntity: cyber domain label for the token
    • ParseTree: sentence structure as a tree
Output Format

A JSON-formatted subgraph of the vertices and edges, which loosely resembles the STIX data model.

{
	"vertices": {
		"1235": {
			"name": "1235",
			"vertexType": "software",
			"product": "Windows XP",
			"vendor": "Microsoft",
			"source": "CNN"
		},
		...
		"1240": {
			"name": "file.php",
			"vertexType": "file",
			"source": "CNN"
		}
	},
	"edges": [
		{
			"inVertID": "1237",
			"outVertID": "1238",
			"relation": "ExploitTargetRelatedObservable"
		},
		{
			"inVertID": "1240",
			"outVertID": "1239",
			"relation": "Sub-Observable"
		}
	]
}

STIX Extraction

Description

The STIXExtractors transforms a structured document into its corresponding STIX subgraph. This component also handles the output of the unstructured document once it is transformed into a structured subgraph.

Input Format

Java String representing the data

Output Format

A JSON-formatted subgraph of the vertices and edges, which loosely resembles the STIX data model.

{
	"vertices": {
		"1235": {
			"name": "1235",
			"vertexType": "software",
			"product": "Windows XP",
			"vendor": "Microsoft",
			"source": "CNN"
		},
		...
		"1240": {
			"name": "file.php",
			"vertexType": "file",
			"source": "CNN"
		}
	},
	"edges": [
		{
			"inVertID": "1237",
			"outVertID": "1238",
			"relation": "ExploitTargetRelatedObservable"
		},
		{
			"inVertID": "1240",
			"outVertID": "1239",
			"relation": "Sub-Observable"
		}
	]
}

Alignment

Description

The Alignment component aligns and merges the new subgraph into the full knowledge graph.

Input Format

A JSON-formatted subgraph of the vertices and edges, which loosely resembles the STIX data model.

{
	"vertices": {
		"1235": {
			"name": "1235",
			"vertexType": "software",
			"product": "Windows XP",
			"vendor": "Microsoft",
			"source": "CNN"
		},
		...
		"1240": {
			"name": "file.php",
			"vertexType": "file",
			"source": "CNN"
		}
	},
	"edges": [
		{
			"inVertID": "1237",
			"outVertID": "1238",
			"relation": "ExploitTargetRelatedObservable"
		},
		{
			"inVertID": "1240",
			"outVertID": "1239",
			"relation": "Sub-Observable"
		}
	]
}
Output Format

JSON-formatted subgraph.

Graph Database Connection

Description

The graph database connection is an interface with specific implementations for each supported database. This interface implements reads from, and writes to, the knowledge graph.

Input Format

JSON-formatted subgraph. (See Alignment output format.)

Output Transport Protocol

Depends on the data storage technology. The current implementation is PostgreSQL relational database, which uses a jdbc driver to execute SQL statements.

Output Format

Depends on the data storage technology. The current implementation is PostgreSQL relational database, which uses SQL statements.


Document Service

Description

The document-service stores and makes available the raw documents. The backend storage is on the local filesystem.

Commands

Add Document

Be sure to set the content-type of the HTTP header when adding documents to the appropriate type (e.g. content-type: application/json for JSON data or content-type: application/pdf for PDF files.

Routes:

  • POST server:port/document - add a document and autogenerate an id
  • POST server:port/document/id - add a document with a specific id

Get Document

The accept-encoding can be set to gzip to compress the communication (i.e., accept-encoding: application/gzip).

The accept command can be one of the following: application/json, text/plain, or application/octet-stream. Use application/octet-stream for PDF files and other binary data.

Routes:

  • GET server:port/document/id - retrieve a document based on the specific id

Input Transport Protocol

HTTP.

Input format

See Collector's Output Format

Output Transport Protocol

HTTP.

Output format

JSON.


Query Service

Description

The Query Service provides a RESTful web service, which communicates with the Graph Database Connection API to allow the UI and any third-party applications to interface with the knowledge graph, implemented in PostgreSQL.

The API will provide functions that facilitate common operations (eg. get a node by ID).

Routes

  • host:port/api/search Returns a list of all nodes that match the search query.
  • host:port/api/vertex/vertexType=<vertType>&name=<vertName>&id=<vertID>
    Returns the node with the specified <vertName> or <vertID>.
  • host:port/api/inEdges/vertexType=<vertType>&name=<vertName>&id=<vertID>
    Returns the in-bound edges to the specified node.
  • host:port/api/outEdges/vertexType=<vertType>&name=<vertName>&id=<vertID>
    Returns the out-bound edges to the specified node.
  • host:port/api/count/vertices Returns a count of all nodes in the knowledge graph.
  • host:port/api/count/edges Returns a count of all edges in the knowledge graph.

Transport Protocol

HTTP.

Transport Format

JSON.

Clone this wiki locally