-
Notifications
You must be signed in to change notification settings - Fork 1
Requirements
This document sets out the requirements for RandomDataset.
The purpose of this document is to describe what the function is of RandomDataset and its intended uses. This is meant for both developer and user audiences. The goal of RandomDataset is to provide a utility for generating tabular databases with randomised contents suitable for testing database software and demonstration purposes. A user can use this utility programmatically through its API as well as on the command line with the included utility program.
Databases are composed of multiple datasets. Each dataset is a table of values with rows representing each instance of a data item and columns or fields defining which data elements each item has. Databases can be stored in a variety of formats but the simplest are text based files such as comma-separated values (csv). For database software it's often important to have data for testing purposes, but real world data shouldn't be used due to privacy and data protection concerns. A method for generating randomised datasets would be an effective and flexible tool to aid testing these systems.
RandomDataset generates databases by reading a schema file describing what the datasets are along with their fields, and producing output data in selected formats. This is a command line tool to be used to produce file outputs from a schema file input, as well as a library which can be used programmatically to build a database and produce output from it. The data generated for the datasets are randomly generated such as simple numbers or random strings, selected randomly from sets of stored data items such as personal names, or are concepts such as ID numbers which start at a value and increment as they are requested to produce sequential results. Fields can also be linked so that values appearing in one dataset are selected to fill fields in another, this is useful for linking instance in one dataset to the IDs of instance in another.
RandomDataset must have a command line interface for generating data from a user-provided schema file. This schema, defined in YAML format, describes what datasets are present, their fields, and what format to generate.