-
Notifications
You must be signed in to change notification settings - Fork 7
arch v2
- Dev-Setup: will set up the test and demonstration environment for Stucco
- Demo: will set up the demonstration environment for Stucco using Packer
The collectors pull data or process data streams and push the collected data (documents) into the message queue. Each type of collector is independent of others. The collectors can be implemented in any language.
Collectors can either send messages with data or messages without data. For messages without data, the collector will add the document to the document store and attach the returned id
to the message.
Collectors can either be stand-alone and run on any host, or be host-based and designed to collect data specific to that host.
Web collectors pull a document via HTTP/HTTPS given a URL. Documents will be decompressed, but no other processing will occur.
Various (e.g. HTML, XML, CSV).
Scrapers pull data embedded within a web page via HTTP/HTTPS given a URL and an HTML pattern.
HTML.
RSS collectors pull an RSS/ATOM feed via HTTP/HTTPS given a URL.
XML.
Twitter collectors pull Tweet data via HTTP from the Twitter Search REST API given a user (@username), hashtag (#keyword), or search term.
JSON.
Netflow collectors will collect from Argus. The collector will listen for argus streams using ra
tool and convert to XML and pipe to send the flow data to the message queue as a string.
String.
Host-based collectors collect data from an individual host using agents.
Host-based collectors should be able to collect and forward:
- System logs
- Hone data
- Installed packages
If we are writing the collector, JSON. If not, whatever format the agent uses.
Stand-alone collectors may require state (state should be stored with the scheduler, such as the last time a site was downloaded). Host-based collectors may need to store state (e.g. when the last collection was run).
Even after collection has taken place there the content may required additional handling. For example, the NVD source is tar'd and gzipped. We specifically provide a post-processing method that will untar and unzip the file before it is further sent on through the pipeline. We've added following post-processing actions on the content:
- unzip: uncompresses the content by first determining the compression type based on the file extension .gz, bz2, etc...
- tar-unzip: untar's the files content prior to uncompressing the content.
- removeHTML: Applies the Boilerpipe process to the content to extract the base text content, ignores the boilerplate/template content in a webpage. In addition, it also use the Apache TIKA library to extract the documents metadata. Recommended for use on all unstructured text sources.
Input transport protocol will depend on the type of collector.
Input format will depend on the type of collector.
Advanced Message Queuing Protocol (AMQP), as implemented in RabbitMQ. See the concepts documentation for information about AMQP and RabbitMQ concepts. See the protocol documentation for more on AMQP. Examples below are in Go using the amqp package. Other libraries should implement similar interfaces.
The RabbitMQ exchange is exchange-type of topic
with the exchange-name of stucco
.
The exchange declaration options should be:
"topic", // type
true, // durable
false, // auto-deleted
false, // internal
false, // noWait
nil, // arguments
The publish options should be:
stucco, // publish to an exchange named stucco
<routingKey>, // routing to 0 or more queues
false, // mandatory
false, // immediate
The <routingKeys>
format should be: stucco.in.<data-type>.<source-name>.<data-name (optional)>
, where:
- data-type (required): the type of data, either 'structured' or 'unstructured'
- source-name (required): the source of the collected data, such as cve, nvd, maxmind, cpe, argus, hone.
- data-name (optional): the name of the data, such as the hostname of the sensor.
The message options should be:
DeliveryMode: 1, // 1=non-persistent, 2=persistent
Timestamp: time.Now(),
ContentType: "text/plain",
ContentEncoding: "",
Priority: 1, // 0-9
HasContent: true, // boolean
Body: <payload>,
DeliveryMode
should be 'persistent'.
Timestamp
should be automatically filled out by your amqp client library. If not, the publisher should specify.
ContentType
should be "text/xml" or "text/csv" or "application/json" or "text/plain" (i.e. collectorType from the output format). This is dependent on the data source.
ContentEncoding
may be required if things are, for example, gzipped.
Priority
is optional.
HasContent
is an application-specific part of the message header that defines whether or not there is content as part of the message. It should be defined in the message header field table using a boolean: HasContent: true
(if there is data content) or HasContent: false
(if the document service has the content). The spout will use the document service accordingly. This is the only application-specific data needed.
Body
is the payload, either the document itself or the id if HasContent
is false.
The corresponding binding keys for the queue defined in the spout will can use the wildcards to determine which spout should handle which messages:
- * (star) can substitute for exactly one word.
- # (hash) can substitute for zero or more words.
For example, stucco.in.#
would listen for all input.
There are two types of output messages: (1) messages with data and (2) messages without data that reference an ID in the document store.
The Scheduler is a Java application that uses the Quartz Scheduler library for running the schedule. The Scheduler instantiates and runs collectors at the scheduled times. The schedule is specified in a configuration file.
From a narrow implementation perspective, that's all the Scheduler does. However, from a broader architectural perspective, it makes sense to discuss major aspects of collection control together. Accordingly, we discuss configuration options and redundancy control here, even though most of their actual implementation is part of the collectors.
The schedule is maintained in the main Stucco configuration file, stucco.yml. In normal operation, the schedule is loaded into the etcd configuration service, and the Scheduler reads it from there. For development and testing purposes, the schedule can also be read directly from file.
The Scheduler's main class is gov.pnnl.stucco.utilities.CollectorScheduler. It recognizes the following switches:
- -section . This tells the Scheduler what section of the configuration to use. It is currently a required switch and should be specified as "–section demo-load".
- -file . This tells the Scheduler to read the collector configuration from the given YAML file, typically stucco.yml.
- -url . This tells the Scheduler to read the collector configuration from the etcd service’s URL, which will typically be http://10.10.10.100:4001/v2/keys/ (the actual IP may vary depending on your setup). Alternatively, inside the VM, you can use localhost instead of the IP.
Each exogenous collector’s configuration contains information about how and when to collect a source. Example from a configuration file:
default:
…
scheduler:
collectors:
-
source-name: Bugtraq
type: PSEUDO_RSS
data-type: unstructured
source-URI: http://www.securityfocus.com/vulnerabilities
content-type: text/html
crawl-delay: 2
entry-regex: 'href="(/bid/\d+)"'
tab-regex: 'href="(/bid/\d+/(info|discuss|exploit|solution|references))"'
next-page-regex: 'href="(/cgi-bin/index\.cgi\?o[^"]+)">Next ><'
cron: 0 0 23 * * ?
now-collect: all`
source-name
The name of the source, used primarily as a key for RT.
type
The type key specifies the primary kind of collection for a source. Here's one way to categorize the types.
Collectors used to handle the most common cases:
- RSS: An RSS feed
- PSEUDO_RSS: A Web page acting like an RSS feed, potentially with multiple pages, multiple entries per page, and multiple subpages (tabs) per entry. This uses regular expressions to scrape the URLs it needs to traverse.
- TABBED_ENTRY: A Web page with multiple subpages (tabs). In typical use, this will be a delegate for one of the above collectors, and won't be scheduled directly.
- WEB: A single Web page. In typical use, this will be a delegate for one of the above collectors, and won't be scheduled directly.
Collectors custom-developed for a specific source:
- NVD: The National Vulnerability Database
- BUGTRAQ: The Bugtraq pseudo-RSS feed. (Deprecated) Use PSEUDO_RSS.
- SOPHOS: The Sophos RSS feed. (Deprecated) Use RSS with a tab-regex.
Collectors used for test/debug, to "play back" previously-captured data:
- FILE: A file on disk
- FILEBYLINE: A file, treated as one document per line
- DIRECTORY: A directory on disk
The URI for a source.
The minimum number of seconds to wait between requests to a site.
The collectors use regular expressions (specifically Java regexes) to scrape additional links to traverse. There are currently keys for three kinds of links:
- entry-regex: In a PSEUDO_RSS feed, this regex is used to identify the individual entries.
- tab-regex: In an RSS or PSEUDO_RSS feed, this regex is used to identify the subpages (tabs) of a page.
- next-page-regex: In a PSEUDO_RSS feed, this regex is used to identify the next page of entries.
When to collect is specified in the form of a Quartz scheduler cron expression.
CAUTION: Quartz's first field is SECONDS, not MINUTES like some crons. There are seven whitespace-delimited fields (six required, one optional):
s m h D M d [Y]
These are seconds, minutes, hours, day of month, month, day of week, and year
- Specify * to mean “every”
- Exactly one of the D/d fields must be specified as ? to indicate it isn’t used In addition, we support specifying a cron expression of now, to mean “immediately run once”.
now-collect
The now-collect configuration key is intended as an improvement on the now cron option, offering more nuanced control over scheduler start-up behavior. This key can take the following values:
-
all
: Collect as much as possible, skipping URLs already collected -
new
: Collect as much as possible, but stop once we find a URL that's already collected -
none
: Collect nothing; just let the regular schedule do it
Most of the Scheduler consists of fairly straightforward use of Quartz. The one area that is slightly more complicated is the logic used to try to prevent or at least reduce redundant collection and messaging. We’re trying to avoid collecting pages that haven’t changed since the last collection. Sometimes we may not have sufficient information to avoid such redundant collection, but we can still try to detect the redundancy and avoid re-messaging the content to the rest of Stucco.
Our strategy is to use built-in HTTP features to prevent redundant collection where possible, and to use internal bookkeeping to detect redundant collection when it does happen. We implement this strategy using the following tactics:
- We use HTTP HEAD requests to see if GET requests are necessary. In some cases the HEAD request will be enough to tell that there is nothing new to collect.
- We make both HTTP HEAD and GET requests conditional, using HTTP’s If-Modified-Since and If-None-Match request headers. If-Modified-Since checks against a timestamp. If-None-Match checks against a previously returned response header called an ETag (entity tag). An ETag is essentially an ID of some sort, often a checksum.
- We record a SHA-1 checksum on collected content, so we check it for a match the next time. This is necessary because not all sites run the conditional checks. For a feed, the checksum is performed on the set of feed URLs.
Because of the timing of the various checks, they are conducted within the collectors.
The internal bookkeeping is currently kept in the CollectorMetadata.db file. Each entry is a whitespace-delimited line containing URL, last collection time, SHA-1 checksum, and UUID.
The Scheduler runs the schedule as expected, controlling when the collectors execute. Other aspects of collection control are less complete, and need improvements in the following areas:
- Exception Handling. Minor exceptions during collection are generally ignored. However no attempt is yet made to deal with more serious exceptions. In particular, no attempt is made to ensure that the metadata recording, document storage, and message sending are performed in a transactional manner. The Scheduler does have a shutdown hook so it can attempt to exit gracefully for planned shutdowns.
- Collector Metadata Storage. This is currently implemented strictly as proof-of-principle. Metadata is stored to a flat file, requiring constant re-loading and re-writing of the entire file. We know this won't scale, and plan to migrate to an embedded database.
- Leveraging robots.txt. The code does not currently read a site's robots.txt file. It should do so in order to determine the throttling setting, as well as know if it should avoid collection of some files. Currently, we can honor these in the configuration file by using the crawl-delay setting and by only specifying URLs that are fair game.
The message queue accepts input (documents) from the collectors and pushes the documents into the processing pipeline. The message queue is implemented with RabbitMQ, which implements the AMQP standard.
The queue should hold messages until processing pipeline acknowledges its receipt.
Input and output protocol is AMQP 0-9-1.
The message queue should pass on the data as is from collectors.
RT is the Real-time processing component of Stucco.
The data it receives will be transformed into a subgraph, consistent with the STIX 1.X format, and then aligned with the knowledge graph.
Advanced Message Queuing Protocol (AMQP), as implemented in RabbitMQ.
There are two types of messages: (1) messages with data and (2) messages without data that reference an ID in the document store.
RT will send an acknowledgement to the queue when the messages are received, so that the queue can release these resources.
Depends on the data storage technology. The current implementation is PostgreSQL relational database, which uses a jdbc driver to execute SQL statements.
Depends on the data storage technology. The current implementation is PostgreSQL relational database, which uses SQL statements.
The message queue, RabbitMQ consumer, pulls messages off the queue based on the routing key contained in the message.
Advanced Message Queuing Protocol (AMQP), as implemented in RabbitMQ.
Advanced Message Queuing Protocol (AMQP), as implemented in RabbitMQ.
JSON object with the following fields:
-
source
(string) - the routing key -
timestamp
(long) - the timestamp indicating when the message was collected -
contentIncl
(boolean) - indicates if the data is included in the message -
message
(string) - the data, if included in the message; the document id to retrieve the data, otherwise
The Entity Extractor obtains an unstructured document's content either from the message, or by requesting the document from the document-service. The document content is then annotated with cyber-domain concepts.
Two Java Strings:
- document title
- document text content
Annotated document object (https://nlp.stanford.edu/nlp/javadoc/javanlp/Annotation) with the following information:
- Text: original raw text
- Sentences: list of sentences
- Sentence: map representing one sentence
- Token: word within the sentence
- POSTag: part-of-speech tag
- CyberEntity: cyber domain label for the token
- ParseTree: sentence structure as a tree
- Sentence: map representing one sentence
The Relation Extractor discovers relationships between the concepts and constructs a subgraph of this knowledge.
Java String representing the data source, and an Annotated document object (https://nlp.stanford.edu/nlp/javadoc/javanlp/Annotation) with the following information:
- Text: original raw text
- Sentences: list of sentences
- Sentence: map representing one sentence
- Token: word within the sentence
- POSTag: part-of-speech tag
- CyberEntity: cyber domain label for the token
- ParseTree: sentence structure as a tree
- Sentence: map representing one sentence
A JSON-formatted subgraph of the vertices and edges, which loosely resembles the STIX data model.
{
"vertices": {
"1235": {
"name": "1235",
"vertexType": "software",
"product": "Windows XP",
"vendor": "Microsoft",
"source": "CNN"
},
...
"1240": {
"name": "file.php",
"vertexType": "file",
"source": "CNN"
}
},
"edges": [
{
"inVertID": "1237",
"outVertID": "1238",
"relation": "ExploitTargetRelatedObservable"
},
{
"inVertID": "1240",
"outVertID": "1239",
"relation": "Sub-Observable"
}
]
}
The STIXExtractors transforms a structured document into its corresponding STIX subgraph. This component also handles the output of the unstructured document once it is transformed into a structured subgraph.
Java String representing the data
A JSON-formatted subgraph of the vertices and edges, which loosely resembles the STIX data model.
{
"vertices": {
"1235": {
"name": "1235",
"vertexType": "software",
"product": "Windows XP",
"vendor": "Microsoft",
"source": "CNN"
},
...
"1240": {
"name": "file.php",
"vertexType": "file",
"source": "CNN"
}
},
"edges": [
{
"inVertID": "1237",
"outVertID": "1238",
"relation": "ExploitTargetRelatedObservable"
},
{
"inVertID": "1240",
"outVertID": "1239",
"relation": "Sub-Observable"
}
]
}
The Alignment component aligns and merges the new subgraph into the full knowledge graph.
A JSON-formatted subgraph of the vertices and edges, which loosely resembles the STIX data model.
{
"vertices": {
"1235": {
"name": "1235",
"vertexType": "software",
"product": "Windows XP",
"vendor": "Microsoft",
"source": "CNN"
},
...
"1240": {
"name": "file.php",
"vertexType": "file",
"source": "CNN"
}
},
"edges": [
{
"inVertID": "1237",
"outVertID": "1238",
"relation": "ExploitTargetRelatedObservable"
},
{
"inVertID": "1240",
"outVertID": "1239",
"relation": "Sub-Observable"
}
]
}
JSON-formatted subgraph.
The graph database connection is an interface with specific implementations for each supported database. This interface implements reads from, and writes to, the knowledge graph.
JSON-formatted subgraph. (See Alignment output format.)
Depends on the data storage technology. The current implementation is PostgreSQL relational database, which uses a jdbc driver to execute SQL statements.
Depends on the data storage technology. The current implementation is PostgreSQL relational database, which uses SQL statements.
The document-service stores and makes available the raw documents. The backend storage is on the local filesystem.
Be sure to set the content-type
of the HTTP header when adding documents to the appropriate type (e.g. content-type: application/json
for JSON data or content-type: application/pdf
for PDF files.
Routes:
- POST
server:port/document
- add a document and autogenerate an id - POST
server:port/document/id
- add a document with a specific id
The accept-encoding
can be set to gzip
to compress the communication (i.e., accept-encoding: application/gzip
).
The accept
command can be one of the following: application/json
, text/plain
, or application/octet-stream
. Use application/octet-stream
for PDF files and other binary data.
Routes:
- GET
server:port/document/id
- retrieve a document based on the specific id
HTTP.
HTTP.
JSON.
The Query Service provides a RESTful web service, which communicates with the Graph Database Connection API to allow the UI and any third-party applications to interface with the knowledge graph, implemented in PostgreSQL.
The API will provide functions that facilitate common operations (eg. get a node by ID).
-
host:port/api/search
Returns a list of all nodes that match the search query. -
host:port/api/vertex/vertexType=<vertType>&name=<vertName>&id=<vertID>
Returns the node with the specified<vertName>
or<vertID>
. -
host:port/api/inEdges/vertexType=<vertType>&name=<vertName>&id=<vertID>
Returns the in-bound edges to the specified node. -
host:port/api/outEdges/vertexType=<vertType>&name=<vertName>&id=<vertID>
Returns the out-bound edges to the specified node. -
host:port/api/count/vertices
Returns a count of all nodes in the knowledge graph. -
host:port/api/count/edges
Returns a count of all edges in the knowledge graph.
HTTP.
JSON.