-
Notifications
You must be signed in to change notification settings - Fork 1
Design
This page describes the design of the RandomDataset library.
The library is composed of the following main modules:
This contains the entry point for the command line utility called generate_dataset
. This routine uses the click
library to expose it as a program which parses command line arguments and can produce helpful information with the --help
flag.
There is also a test routine print_csv_test
which creates a simple schema, generates data using it, then prints the results to stdout.
The main application loop as implemented in generate_dataset
reads the specified schema file which is expected to return a list of generators. Each generator is expected to produce one dataset. These objects are visited in order in a loop and used to write to a specified destination path, this will fill in a single database file or produce multiple files with that destination as prefix. This is illustrated here:
This contains the routine parse_schema
which reads a YAML schema file and instantiates the list of objects it specifies. The expectation is that the schema file defines a list of DataGenerator
objects so these would be the routine's return value. PyYAML is used to parse the schema file.
Fields represent the columns of datasets in that they are queried by the Dataset
object to generate one or more values. What sort of data is generated is totally dependent on what fields are selected in the schema file.
The base FieldGen
class is inherited by specialised types to produce specific sorts of data. For example, IntFieldGen
to generate random integers, or AlphaNameGen
which produces a first or last name chosen at random from an internal list.
The Dataset
class represents a set of fields (columns) and provides methods for accessing data by row or column. Generators use datasets as the source of data to write to a destination. If fields need to share data amongst themselves when generating, such as linked fields, sharing is done through the dataset's shared storage mechanism.
Generators implement the data generation component of the library through subclasses of the DataGenerator
class. Methods provided by this class handle writing data to a stream or file, but rely on subclasses to implement the write_stream
which defines what form of data is written.
One subclass, CSVGenerator
, is provided which generates comma-separated tables of data. Each dataset is written to its own file.