Flowtuple Formats

Flowtuple 4

The flowtuple4 format resulted from the realisation that the storage requirements of flowtuple3 files were not sustainable. The sheer quantity of flows that were being recorded, and the rate at which the number was growing, was identified as the primary cause of this problem.

In flowtuple4, we address this by identifying flows using destination IP networks rather than individual destination IP addresses, i.e. destination IP addresses are masked to a parent subnet and that subnet address is used when deriving flow keys. The reasoning is that, in the telescope context, individual destination addresses have little meaning (aside from the range of addresses that are probed as part of a particular event) so creating individual flow entries for each address seen costs a lot of storage without adding much useful information. Instead, we can track the number of unique destination addresses seen for each subnet-flow and include that information in the final flowtuple record; this means we can retain the key information while reducing our number of flow records significantly in some cases.

Example

Consider a scanning process that sends the same TCP SYN packet to port 22 on all addresses within the 10.0.100.0/24 subnet.

In flowtuple3, this would result in 256 flow entries with a lot of repeated information:

<timestamp>,<scanner IP>,10.0.100.0,<src port>,22,6 ....
<timestamp>,<scanner IP>,10.0.100.1,<src port>,22,6 ....
<timestamp>,<scanner IP>,10.0.100.2,<src port>,22,6 ....
<timestamp>,<scanner IP>,10.0.100.3,<src port>,22,6 ....
<timestamp>,<scanner IP>,10.0.100.4,<src port>,22,6 ....

In flowtuple4, this would become a single flow entry:

<timestamp>,<scanner IP>,10.0.100.0,22,6,256 ....

Note two things in the example:

src port is no longer included in the flow key (more on this later)
the addition of 256, which is the number of unique destination IPs in the subnet that saw a packet matching this flow description. Seeing 256 tells us that the sender covered the whole subnet. If we saw a small number instead, then they did not (but the specific addresses are still unlikely to matter in most cases).

Aggregation of other data fields

A similar principle applies to many of the other data fields that we were previously tracking in flowtuple3, such as TTL, source port or packet size. Specific values for these fields are only important if they occur all (or almost all) of the time across every packet that is seen for a subnet-flow. However, if the values are quite varied, then the individual values do not matter but the fact that the values vary is still worth recording.

With this in mind, we now have a number of data fields where we record three things:

the number of unique values seen for this data fields (much like we do with unique destination IP addresses)
an array containing any data values which occur frequently for the flow
an array containing the frequency at which the values shown in the preceding array occurred.

Note that the indexing of the two arrays is consistent, i.e. the quantity at index 0 in the frequency array is the frequency of the value at index 0 in the data value array.

Required frequencies

To be considered a "frequently occurring" value, the value must be seen in at least 20% of packets that are matched to a given flow. However, for flows where the packet count is low, we increase this ratio to compensate for the low sample size.

The following table should explain the algorithm we use to determine if a value is frequent or not:

Total Flow Packets	Minimum Ratio
1 - 4	1.0
5 - 6	0.5
7 - 14	0.33
15 +	0.2

If there are 4 or less packets in a flow, then all packets must have the same data field value for that value to be considered frequent. If there are 10 packets in a flow, then a data field value is frequent if at least 4 packets have it (10 * 0.33 == 3.33, rounded up). If a flow has 24 packets, the required number of matching packets is 5 (24 * 0.2 == 4.8, rounded up).

Remaining data fields

For a few fields, we decided that it was not worth the processing effort of recording the range of individual observed values and their frequencies, but we still wanted to record something in case that information was useful at some later point. For those fields, we simply report the value that was observed for the first packet that matched the flow description.

Finally, there are a handful of data fields that are derived by cross-referencing the source IP against other databases to add extra information (e.g. geo-location and ASN). As we are still dividing flows by individual source IPs, these fields are unchanged in flowtuple4.

Output file schema

For flowtuple4, we have retained the Avro format but we now using the following schema:

Field Name	Type	Role	Notes
time	long	Key	Timestamp of the interval that this flowtuple belongs to
src_ip	long	Key	Source IP address, as an integer
dst_net	long	Key	The destination IP network, as an integer
dst_port	long	Key	The destination port for TCP and UDP flows, (type << 8) + code for ICMP flows
protocol	int	Key	The transport protocol used, e.g. 6 = TCP, 17 = UDP
packet_cnt	long	Counter	Number of packets seen in this interval that match this flow description
uniq_dst_ips	long	Counter	Number of unique destination IPs seen for this flow
uniq_pkt_sizes	long	Counter	Number of unique packet sizes seen for this flow
uniq_ttls	long	Counter	Number of unique IP TTLs seen for this flow
uniq_src_ports	long	Counter	Number of unique source ports seen for this flow (TCP and UDP only)
uniq_tcp_flags	long	Counter	Number of unique TCP flag combinations seen for this flow (TCP only)
first_syn_length	int	First	Only applies to TCP flows; the size of the TCP header (e.g. doff * 5) for the first observed packet
first_tcp_rwin	int	First	Only applies to TCP flows; the receive window announced in the first observed TCP SYN packet
common_pktsizes	array(long)	Observed Values	Array containing packet sizes that were frequently observed for this flow
common_pktsize_freqs	array(long)	Frequencies	Array containing frequencies for packet sizes listed in common_pktsizes array
common_ttls	array(long)	Observed Values	Array containing IP TTLs that were frequently observed for this flow
common_ttl_freqs	array(long)	Frequencies	Array containing frequencies for IP TTLs listed in common_ttls array
common_srcports	array(long)	Observed Values	Array containing TCP/UDP source ports that were frequently observed for this flow
common_srcport_freqs	array(long)	Frequencies	Array containing frequencies for IP TTLs listed in common_srcports array
common_tcpflags	array(long)	Observed Values	Array containing TCP flag combinations that were frequently observed for this flow
common_tcpflag_freqs	array(long)	Frequencies	Array containing frequencies for TCP flags listed in common_tcpflags array
maxmind_continent	string	Derived from Source IP	Geo-location of the source IP address, according to Maxmind (continent level)
maxmind_country	string	Derived from Source IP	Geo-location of the source IP address, according to Maxmind (country level)
netacq_continent	string	Derived from Source IP	Geo-location of the source IP address, according to Netacq-Edge (continent level)
netacq_country	string	Derived from Source IP	Geo-location of the source IP address, according to Netacq-Edge (country level)
prefix2asn	long	Derived from Source IP	ASN that the source IP address belongs to, according to the `prefix2asn` dataset
spoofed_packet_cnt	long	Counter	Number of packets where the source IP address was inferred to be spoofed
masscan_packet_cnt	long	Counter	Number of packets that were inferred to be sent by the masscan tool

Flowtuple 3

The flowtuple3 format was created alongside corsaro3. The goal was to take the existing flowtuple2 record format and expand it to include additional fields that could be valuable to users of the flowtuple data.

flowtuple3 files are written to disk using the Avro format. Each flowtuple is an Avro record conforming to the following schema:

Field Name	Type	Is Key Field?	Notes
time	long	Yes	Timestamp of the interval that this flowtuple belongs to
src_ip	long	Yes	Source IP address, as an integer
dst_ip	long	Yes	Destination IP address, as an integer
src_port	int	Yes	For ICMP, this is the ICMP type
dst_port	int	Yes	For ICMP, this is the ICMP code
protocol	int	Yes	The transport protocol used, e.g. 6 = TCP, 17 = UDP
ttl	int	Yes	TTL from the IP header
tcp_flags	int	Yes	Only applies to TCP flows; the 8 bits of TCP flags as an integer (ignores NS flags)
ip_len	int	Yes	The size of the observed packet, according to the IP header
tcp_synlen	int	No	Only applies to TCP flows; the size of the TCP header (e.g. doff * 5)
tcp_synwinlen	int	No	Only applies to TCP flows; the receive window announced in the TCP SYN packet
packet_cnt	long	No	Number of packets seen in this interval that match this flow description
is_spoofed	int	No	Flag indicating whether we thought the packets for this flow were spoofed
is_masscan	int	No	Flag indicating whether we thought the packets were generated by the `masscan` software
maxmind_continent	string	No	Geo-location of the source IP address, according to Maxmind (continent level)
maxmind_country	string	No	Geo-location of the source IP address, according to Maxmind (country level)
netacq_continent	string	No	Geo-location of the source IP address, according to Netacq-Edge (continent level)
netacq_country	string	No	Geo-location of the source IP address, according to Netacq-Edge (country level)
prefix2asn	long	No	ASN that the source IP address belongs to, according to the `prefix2asn` dataset

Each flowtuple3 file generated by corsaro3 represents an interval of a single minute.

Flowtuple3 files generated during or before April 2021 probably will not have valid Maxmind geo-location tags. They will just have ?? values to represent that the location is unknown.

The flowtuple plugin for corsaro3 will still generate files using the flowtuple3 format -- the ftconvert tool can be used to convert these files into the flowtuple4 format, if required.

Much of our supporting tools and APIs for dealing with flowtuple data have been written to work with flowtuple3 data -- for instance, the pyspark API in stardust-tools is written for flowtuple3 Avro files.

Flowtuple 2

Flowtuple2 refers to the format that was used to encode flowtuple records captured using corsaro2. Flowtuple2 is a simple packed binary format that was designed by Alistair specifically for storing long-term flowtuple data from the telescope.

Flowtuple2 is very space-efficient (especially when the flowtuples are properly sorted and compressed) and is relatively simple to generate. However, it can only be read (or written) using software that is written to use our libcorsaro v2 library and thus is not easy to integrate with modern big-data processing engines (such as Spark/Hadoop). This means we had to do development work any time we wanted to perform new types of analysis on the collected data.

In flowtuple2, the reporting interval is set to 1 minute, i.e. all flows observed are written to the output file every minute. Output files are rotated every hour, so there are up to 60 intervals worth of data in each file. More recent flowtuple2 files therefore can be quite large.

Because flowtuple2 collection has been running for a number of years, we currently have a large archive of historical flowtuple data that is in this format. The plan is to convert (some of) this data into the more processing friendly flowtuple4 format, but until then we will still need to have software available that can read and process the flowtuple2 format.

Flowtuple2 file structure

The flowtuple2 file begins with a custom header and ends with a trailer which contain various pieces of meta-data about the data collection process. Between those, the record sets for each interval will appear in chronological order.

Each interval within a flowtuple2 file is bookended by an special record that indicates the start or end of the interval (as well as the interval timestamp). Within that, the flowtuples are divided into one of three classes, again each bookended with class start and end records. The classes are "backscatter", "ICMP request" and "other". Each flowtuple itself is a record within its corresponding class structure.

With all that in mind, the structure of a flowtuple2 file will look something like:

FLOWTUPLE HEADER
INTERVAL START 0 <interval timestamp 1>
  CLASS START BACKSCATTER <number of flowtuples>
    [backscatter flowtuple 1]
    [backscatter flowtuple 2]
    ...
    [backscatter flowtuple N]
  CLASS END BACKSCATTER
  CLASS START ICMPREQ <number of flowtuples>
    [icmpreq flowtuple 1]
    [icmpreq flowtuple 2]
    ...
    [icmpreq flowtuple N]
  CLASS END ICMPREQ
  CLASS START OTHER <number of flowtupels>
    [other flowtuple 1]
    [other flowtuple 2]
    ...
    [other flowtuple N]
  CLASS END OTHER
INTERVAL END 0
INTERVAL START 1 <interval timestamp 2>
... etc. etc.
INTERVAL END 59
FLOWTUPLE TRAILER

Flowtuple File Format

Thankfully, documentation of the binary file format already exists at https://www.caida.org/tools/measurement/corsaro/docs/formats.html

Reading flowtuple2 files

Using existing tools: you will need to install the v2 branch of corsaro to be able to read flowtuple2 files using the tools that already exist (such as cors2ascii). Your best bet will be to install corsaro2 as per https://www.caida.org/tools/measurement/corsaro/docs/quickstart.html and go from there.

Note that corsaro2 and corsaro3 cannot co-exist on the same host, which can be quite a nuisance.

There is also the libflowtuple library that was written by Mark Weiman to assist with large-scale processing of flowtuple2 files. This also includes a couple of basic tools, and can be used to write custom flowtuple2 processing code. libflowtuple can be installed on the same host as corsaro3 and play nicely, but doesn't quite have the same level of documentation and packaging as corsaro2 does (yet).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly