TraP_trans_tools Script Archive

This folder contains the files required to analyse the transient parameters for a large number of sources. The tools developed are not TraP specific, however there are executable scripts demonstrating how to use these scripts with outputs from TraP databases (process_TraP.py and train_TraP.py). The techniques implemented in these scripts and how to interpret the outputs are described in Rowlinson et al. (In Prep).

Acknowledgements

This work utilises a number of {sc Python} libraries, including the Matplotlib plotting libraries Hunter (2007, Computing in Science & Engineering, 9, 90-95) and astroML (Vanderplas et al., 2012, in Conference on Intelligent Data Understanding, 47–54). Finally, the authors acknowledge training in machine learning strategies received from the Stanford Machine Learning course available on Coursera (https://www.coursera.org/course/ml).

dump_trans_data_v1: Tools to extract required data from a TraP database (TraP release 1). Outputs the transient parameters and individual extracted sources.
format_TraP_data: Tools to extract required to convert data extracted by dump_trans_data_v1 into the format required for the other scripts.
generic_tools: General tools to help with processing data.
plotting_tools: Tools to generate the scatter/histogram and diagnostic plots to visualise datasets.
process_TraP: An example executable script that fully processes data from TraP (from reading database to creating plots).
train_anomaly_detect: Tools to train and use an anomaly detection algorithm (machine learning).
train_logistic_regression: Tools to train and use an logistic regression algorithm (machine learning).
train_TraP: An example executable script that trains and applies machine learning algorithms using the outputs from process_TraP.py.
examples: A folder containing example training data for train_TraP.py.

Requirements

tkp (The LOFAR Transients Pipeline, TraP)
numpy
scikit-learn
astroML
os
pylab
csv
scipy
math
matplotlib
random

dump_trans_data_v1.py

Requirements: tkp (TraP installed and initiated), csv

dump_trans

Inputs: database name, dataset id, engine (monetdb or postgres), host, port, username, password

Info: queries the database to extract the transient and extracted source tables. Files saved in working directory as "ds_${dataset id}_transients.csv" and "ds_${dataset id}_sources.csv"
dump_list_of_dicts_to_csv

Inputs: data, output filename

Info: writes the data out to a file in the csv format.

format_TraP_data.py

Requirements: dump_trans_data_v1, generic_tools, os, numpy

get_data

Inputs: database, dataset_id, release, host, port, user, password

Info: Calls dump_trans_data_v1.py to obtain the data and put it into a CSV file
read_src_lc

Inputs: sources data, lightcurves

Info: sorts the lightcurves for the unique sources in the extracted sources data. lightcurves is a true/false option to output the lightcurves

Outputs: unique frequencies, source lightcurves
collate_trans_data

Inputs: unique source lightcurves, frequencies, transient sources

Info: sorts the data into the required information for later analysis. Each unique source has its transient parameters, maximum flux, ratio between maximum flux and average flux and observing frequency.

Outputs: transient data for each source
format_data

Inputs: database, dataset_id, release, host, port, username, password, lightcurves

Info: uses dump_trans_data to extract data, extracts the transient data for each unique source and saves the output to "ds_${dataset id}_trans_data.csv" for use in later analysis.

generic_tools.py

Requirements: numpy, scipy

extract_data

Inputs: filename

Info: reads the data in a given file into an array.

Outputs: array of data
get_frequencies

Inputs: transient data

Info: finds all the unique values in the "frequency" column. This column typically contains observing frequencies, but is also used to identify different types of transients in the machine learning code.

Outputs: the unique frequencies
get_sigcut

Inputs: data, sigma

Info: fits the 1D data with a Gaussian distribution and finds the threshold associated with a given sigma.

Outputs: sigma threshold, Gaussian fit parameters, range fitted over
precision_and_recall

Inputs: Number of true positive, false positive and false negative identifications

Info: This calculates the precision (probability that a source is correctly identified) and recall (probability that all sources have been identified) for given results. Required for assessing quality of machine learning results.

Outputs: precision, recall
label_data

Inputs: data, label1, label2

Info: Inserts label1 into the frequency column, typically a string which is the type of transient. Appends a new column with either 1 or 0 to represent transient and stable.

Outputs: labelled data

plotting_tools.py

Requirements: numpy, scipy, matplotlib, math, pylab, astroML

make_colours

Inputs: unique frequencies

Info: assigns a colour from a colourmap (jet) to each unique frequency for plotting.

Outputs: colours
create_scatter_hist

Inputs: data for plotting, sigma thresholds for x-axis and y-axis, parameters from Gaussian fit for x and y axes, range used for fitting Gaussian distributions, dataset id, unique frequencies

Info: creates a plot showing the two transient parameters (typically Eta_nu and V_nu) with histograms and fitted Gaussian distributions. If thresholds are not equal to 0, it also plots dashed lines to represent the thresholds used on the transient parameters. Plot saved as "ds${dataset id}_scatter_hist.png". See example

Outputs: a list of transient id numbers
create_diagnostic

Inputs: data for plotting, thresholds for x-axis and y-axis, unique frequencies, dataset id

Info: creates a scatter plot showing four transient parameters (typically Eta_nu, V_nu, max flux and ratio between max flux and average flux). If thresholds are not equal to 0, it also plots dashed lines to represent the thresholds used on the transient parameters. Plot saved as "ds${dataset id}_diagnostic_plots.png". See example

process_TraP.py

Requirements: format_TraP_data, plotting_tools, generic_tools, numpy, sys

An example executable script for processing TraP data. Usage:

python process_TraP.py <database> <username> <password> <dataset_id> <release> <host> <port> <sigma1> <sigma2> <lightcurves>

<database>: name of TraP database containing data

<username>: your database username

<password>: your database password

<dataset_id>: the dataset id that is to be processed

<release>: TraP release and engine, options p and m (postgres and monetdb respectively)

<host>: The machine hosting the database

<port>: The port number for the machine hosting the database

<sigma1>: sigma threshold for use in determining threshold on Eta_nu

<sigma2>: sigma threshold for use in determining threshold on V_nu

<lightcurves>: "True/False" option to output the lightcurves for each unique source

This script will extract data from the database, identify unique sources and obtain their lightcurves, sort the transient parameters and create the various diagnostic plots.

train_anomaly_detect.py

Requirements: generic_tools, numpy, multiprocessing, scipy, operator, matplotlib, pylab

trial_data

Inputs: data, sigma1, sigma2

Info: tries out a given pair of thresholds on the labelled data. It calculates the true positives, false positives, true negatives and false negatives. These are then used to calculate the precision and recall.

Outputs: sigma1, sigma2, precision, recall
multiple_trials

Inputs: data

Info: Runs trial_data using different sigma values. sigma1 and sigma2 both range from 0 to 4 sigma with 500 bins. This is using a multiprocessing pool with 4 processes. The data are appended to a file, "sigma_data.txt".
tests

Inputs: list containing x sigma data,y sigma data, precisions, recalls, training data, transient x data, transient y data, stable x data, stable y data, required precision, required recall

Info: Using the gridded precision and recall values for different combinations of sigma, find the best match to the input required precision and recall. Then use observed training data to measure the obtained precision and recall values. These should be roughly equal.

Outputs: list containing: required precision, required recall, obtained precision, obtained recall
check_method_works

Inputs: list containing x sigma data,y sigma data, precisions, recalls, training data, above threshold sigma

Info: Runs the test function multiple times for a wide range of input precisions and recalls. Plots a figure to show the performance.
find_best_sigmas

Inputs: required precision, required recall, sigma data

Info: Creates a 2000x2000 grid using the data in "sigma_data.txt" with a cubic interpolation between the trialed data points. The parameter space that gives the required precision and recall is identified, then use an F-score to identify the optimal balance of precision and recall in this parameter space. A plot illustrating the precision and recall parameter space is output and here is an example

Outputs: best sigma threshold for Eta_nu, best sigma threshold for V_nu

train_logistic_regression.py

Requirements: generic_tools, numpy, scipy, random, matplotlib, pylab

shuffle_datasets

Inputs: data

Info: Ensures that your data is randomised so that the training, validation and testing datasets do not contain too many of one kind of source.

Outputs: shuffled data
create_datasets

Inputs: data, number of training datapoints, number of validation datapoints

Info: splits the data array into 3, with the required number of datapoints. The number of testing datapoints constitutes the remaining data.

Outputs: training, validation and testing datasets
create_X_y_arrays

Inputs: data

Info: Splits the data into the parameters and labels (as required for the machine learning algorithm).

Outputs: parameters and labels
sigmoid

Inputs: value

Info: calculates the sigmoid of a given value (1/(1+e^(-z)))

Outputs: sigmoid(value)
reg_cost_func

Inputs: theta, X, y, lda

Info: Calculates the regularised cost function for a given model (theta) and dataset. The lda (lambda) parameter regularises it, i.e. controls the weighting given to multiple parameters.

Outputs: cost of the model
quadratic_features

Inputs: data

Info: can double the number of parameters in the model by squaring them. i.e. [x1, x2] becomes [x1, x2, x1^2, x2^2].

Outputs: quadratic data
learning_curve

Inputs: Xtrain, ytrain, Xvalid, yvalid, lda, options for scipy.optimise

Info: finds the optimal model for a given training set and calculates the training and validation errors for that model. The training set starts with 1 datapoint and is incremented by 1 until the full training set is used. This test can check that the model is converging.

Outputs: training and validation errors, theta
check_error

Inputs: X, y, theta

Info: measures the classification error for a given dataset and model.

Outputs: error
validation_curve

Inputs: Xtrain, ytrain, Xvalid, yvalid, options for scipy.optimise

Info: Uses a range of lambda values (1e-5 to 1e5) input into the training algorithm to check that the data is not being overfitted by the model and can be used to chose the optimal lambda value.

Outputs: training and validation errors, lambda values, optimal lambda
plotLC

Inputs: error_train, error_val, fname, xlog (True/False), ylog (True/False), xlabel

Info: A plotting algorthm used to create figures showing the training and validation errors. Here are example learning , repeat

and validation curves
classify_data

Inputs: X, y, theta

Info: Classifies a given dataset and then compares to the predictions to identify the true positives, false postives, true negatives and false negatives

Outputs: tp, fp, fn, tn, classified data
predict

Inputs: X, theta

Info: Predicts the classification of new, unknown data.

Outputs: predicted classifications

train_sigma_margin.py

Requirements: numpy, matplotlib, pylab, generic_tools

sort_data

Inputs: The dataset to be used

Info: find the best and worst expected detection significances for each of the sources and extract the detection threshold

Outputs: The best and worst thresholds and the detection threshold
find_sigma_margin

Inputs: best significance observed data, worst significance observed data, best significances simulated data, worst significances simulated data, detection threshold

Info: Find the precision, recall and F-score for a range of different margins applied to the best and worst significances

Outputs: best plot data, worst plot data
plot_hist

Inputs: Observed data, simulated data, detection threshold, label for figure name

Info: Creates histograms of the input data
plot_diagnostic

Inputs: best plot data, worst plot data

Info: Create a diagnostic plot illustrating the precision, recall and F-score as a function of the sigma margin. Identify the optimal margins.

Outputs: Optimal best sigma margin, Optimal worst sigma margin

train_TraP.py

Requirements: train_anomaly_detect, train_logistic_regression, plotting_tools, generic_tools, glob, sys, numpy

An example executable script for processing TraP data. Usage:

python train_TraP.py <precision threshold> <recall threshold> <lda> <anomaly> <logistic> <trans> <tests>

<precision threshold>: required precision of transient identification (1 - False Detection Rate). A probability in the range 0-1

<recall threshold>: required recall, i.e. the probability that all transients are found (0-1)

<lda>: the lambda value to be used in the logistic regression algorithm

<anomaly>: train anomaly detection algorithm? T/F (if F give 0 for both the precision and recall thresholds)

<logistic>: train logistic regression algorithm? T/F (if F give 0 for lda)

<trans>: train the transient detection algorithm? T/F

<tests>: run the test scripts for anomaly detection and logistic regressions

This script uses pre-processed datasets, in the format output by format_trap_data.format_data. The stable sources are in a file named "stable_trans_data.txt". Transient sources are in files "sim_${transient type}_trans_data.txt" where transient type is a short string describing the type of transient source (used for labelling sources in diagnostic plots instead of the frequency parameter). The script trains both the anomaly detection algorithm and logistic regression algorithm, outputting diagnostic plots. The anomaly detection algorithm outputs the best transient search thresholds for use in e.g. TraP, while the logistic regression algorithm outputs an equation that can classify sources. Each method reports its precision and recall. Additionally they output text files with the candidate transient and variable sources identified. Example training files and output plots are given here

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
README.rst		README.rst
dump_trans_data_v1.py		dump_trans_data_v1.py
format_TraP_data.py		format_TraP_data.py
generic_tools.py		generic_tools.py
plotting_tools.py		plotting_tools.py
process_TraP.py		process_TraP.py
process_XMM.py		process_XMM.py
sim_fred_trans_data.txt		sim_fred_trans_data.txt
sim_gaussian_trans_data.txt		sim_gaussian_trans_data.txt
sim_periodic_trans_data.txt		sim_periodic_trans_data.txt
sim_single_flare_trans_data.txt		sim_single_flare_trans_data.txt
sim_slow_fall_trans_data.txt		sim_slow_fall_trans_data.txt
sim_slow_rise_trans_data.txt		sim_slow_rise_trans_data.txt
sim_turn_off_trans_data.txt		sim_turn_off_trans_data.txt
sim_turn_on_trans_data.txt		sim_turn_on_trans_data.txt
train_TraP.py		train_TraP.py
train_XMM.py		train_XMM.py
train_anomaly_detect.py		train_anomaly_detect.py
train_logistic_regression.py		train_logistic_regression.py
train_sigma_margin.py		train_sigma_margin.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TraP_trans_tools Script Archive

Acknowledgements

Contents

Requirements

dump_trans_data_v1.py

format_TraP_data.py

generic_tools.py

plotting_tools.py

process_TraP.py

train_anomaly_detect.py

train_logistic_regression.py

train_sigma_margin.py

train_TraP.py

About

Releases

Packages

Languages

AntoniaR/TraP_trans_tools

Folders and files

Latest commit

History

Repository files navigation

TraP_trans_tools Script Archive

Acknowledgements

Contents

Requirements

dump_trans_data_v1.py

format_TraP_data.py

generic_tools.py

plotting_tools.py

process_TraP.py

train_anomaly_detect.py

train_logistic_regression.py

train_sigma_margin.py

train_TraP.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages