Skip to content
Eric Kerfoot edited this page Mar 22, 2023 · 8 revisions

This page describes the design of the RandomDataset library.

Modules Overview

RandomDataset drawio

The library is composed of the following main modules:

application

This contains the entry point for the command line utility called generate_dataset. This routine uses the click library to expose it as a program which parses command line arguments and can produce helpful information with the --help flag.

There is also a test routine print_csv_test which creates a simple schema, generates data using it, then prints the results to stdout.

The main application loop as implemented in generate_dataset reads the specified schema file which is expected to return a list of generators. Each generator is expected to produce one dataset. These objects are visited in order in a loop and used to write to a specified destination path, this will fill in a single database file or produce multiple files with that destination as prefix. This is illustrated here:

generate_dataset

schemaparser

This contains the routine parse_schema which reads a YAML schema file and instantiates the list of objects it specifies. The expectation is that the schema file defines a list of DataGenerator objects so these would be the routine's return value. PyYAML is used to parse the schema file.

fields

Fields represent the columns of datasets in that they are queried by the Dataset object to generate one or more values. What sort of data is generated is totally dependent on what fields are selected in the schema file.

The base FieldGen class is inherited by specialised types to produce specific sorts of data. For example, IntFieldGen to generate random integers, or AlphaNameGen which produces a first or last name chosen at random from an internal list.

fields

dataset

The Dataset class represents a set of fields (columns) and provides methods for accessing data by row or column. Generators use datasets as the source of data to write to a destination. If fields need to share data amongst themselves when generating, such as linked fields, sharing is done through the dataset's shared storage mechanism.

generators

Generators implement the data generation component of the library through subclasses of the DataGenerator class. Methods provided by this class handle writing data to a stream or file, but rely on subclasses to implement the write_stream which defines what form of data is written.

One subclass, CSVGenerator, is provided which generates comma-separated tables of data. Each dataset is written to its own file.

Clone this wiki locally