Skip to content
/ gess Public
forked from mapr-demos/gess

A generator for synthetic streams of financial transactions.

Notifications You must be signed in to change notification settings

agoujet/gess

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gess

A _ge_nerator for _s_ynthetic _s_treams of financial transactions (ATM withdrawals).

Usage

First, start gess like so:

$ ./gess.sh start 

Then, check if gess is working fine:

$ ./dummy_gess_sink.sh

Once active, gess will stream synthetic data about ATM withdrawals, in a line-oriented, JSON-formatted fashion on default port 6900 via UDP (which you can observe as the output of dummy_gess_sink.sh):

...
{
  'timestamp': '2013-11-08T10:58:19.668225',
  'atm' : 'Santander',
  'lat': '36.7220096',
  'lon': '-4.4186772',
  'amount': 100,
  'account_id': 'a335',
  'transaction_id': '636adacc-49d2-11e3-a3d1-a820664821e3'
}
...

Note 1: The average size of one transaction (interpreted as a string) is ca. 200-250 Bytes. This means gess is typically able to emit some 2MB/sec resulting in some 7GB/h of transaction data.

Note 2: that in the above example, showing a withdrawal in Spain, the data has been re-formatted for readability reasons. In fact, each transaction spans a single line and is terminated by a \n.

Note 3: that dummy_gess_sink.sh both echoes the received values on screen and logs them in a file with the name dummy_gess_sink.log.

Dependencies

  • Python 2.7+
  • For the data extraction part only (adding own ATM locations via OSM dumps): imposm.parser which in turn depends on ProtoBuf installed.

Data

Default setting (Spanish ATM locations )

We aim for quality synthetic data. To this end, the default data used for the ATM locations is that of Spain obtained from the OpenStreetMap project. To be more precise, the default data are the geo-coordinates of 822 ATMs in Spain which have been downloaded via the POI export service.

The withdrawal amounts are stacked (20, 50, 100, 200, 300, 400) and the rest of the data (transaction ID/timestamp) is arbitrary.

Note that the fraudulent transactions (consecutive withdrawals in different location in a short time frame) will be marked in that they have a transaction_id that reads xxx and then the transaction_id of the original transaction. This is for convenience reasons to enable a simpler CLI-level debugging but can otherwise be ignored.

Extending ATM locations

If you want to add new ATM locations, then you need to do the following:

  1. Choose a geographic area and download the respective .osm dump from sites such as Metro Extracts.
  2. Then, run data/extract_atms.py, which uses the ATM-tagged nodes in OSM/XML format and extracts/converts it into the CSV format used internally, by gess.
  3. Add the so generated ATM location data file in CSV format to gess.conf so that gess picks it up on startup time.

To give you an idea in terms of performance: on my laptop (a MBP with 16 GB RAM) it takes approximately 3 min to extract 416 ATM locations from the San Francisco Bay Area OSM file. This OSM file contains some 198,000 nodes with a raw, unzipped file size of 1.45 GB.

Understanding the runtime statistics

In parallel to the data streaming, gess will output runtime statistics every 10 sec into the log file gess.tsv by using a TSV format that looks as following (slightly re-formatted for readability):

timestamp            num_fintrans tp_fintrans num_bytes tp_bytes
2014-02-03T05:56:59  101          10          23        2
2014-02-03T05:57:09  102          10          23        2
2014-02-03T05:57:19  99           9           22        2
2014-02-03T05:57:29  97           9           22        2
2014-02-03T05:57:39  106          10          24        2
2014-02-03T05:57:49  108          10          25        2
...

With the following semantics for the columns:

  • num_fintrans … financial transactions emitted in sample interval (in thousands)
  • tp_fintrans … throughput of financial transactions (in thousands/second) in sample interval
  • num_bytes … number of bytes emitted (in MB) in sample interval
  • tp_bytes … throughput of bytes (in MB/sec) in sample interval

So, for example, the first non-header line states that:

  • Some 101,000 financial transactions were emitted, in the sample interval ...
  • ... with a throughput of 10,000 transactions per sec.
  • And further, that 23 MB have been emitted ...
  • ... with a throughput of 2 MB/sec in the sample interval.

License

Apache License, Version 2.0.

About

A generator for synthetic streams of financial transactions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 87.6%
  • Shell 12.4%