Repository for the paper "Odyssey: A Journey in the Land of Distributed Data Series Similarity Search", Manos Chatzakis, Panagiota Fatourou, Eleftherios Kosmas, Themis Palpanas, Botao Peng. VLDB 2023
Odyssey is a modular distributed similarity search framework, appropriate for big data series collections.
To compile the program from scratch, use
cd ads/
mkdir results
mkdir perf
chmod u+x configure
./configure
make
Or easier,
cd ads/run_automation/
chmod u+x compile.sh
./compile.sh
The main parameters of Odyssey are listed below:
mpirun ../bin/ads --dataset [<string>]
--dataset-size [<int>]
--function-type [<int> (default 9990)]
--in-memory
--queries [<string>]
--queries-size [<int>]
--timeseries-size [<int>]
--index-workers [<int>]
--query-workers [<int>]
--query-mode
--basis-function-filename [<string>]
--th-div-factor [<int>]
--workstealing [<int> (0 or 1)]
--node-groups [<int>]
--preprocess [<int> (0 or 1)]
Odyssey consists of a set of modules that the user selects to use to create the index and start answering queries.
The function below is used for chunk-splitting of the dataset among the available nodes.
/**
* @brief Splits the data into chunks, each node group gets assigned a chunk and builds the index.
*
* @param ifilename Dataset path
* @param ts_num Total data seties
* @param index iSAX index parameter
* @param parallelism_in_subtree Flag for index creation mode
* @param node_groups Node groups array
* @param total_node_groups Total number of node groups
* @param total_nodes_per_nodegroup Total number of nodes in each group
* @return float* Raw float array
*/
float *index_creation_node_groups_chatzakis(const char *ifilename, long int ts_num, isax_index *index, const char parallelism_in_subtree,
pdr_node_group *node_groups, int total_node_groups, int total_nodes_per_nodegroup);
The index preprocessing function below is used to distribute the data to each node before index creation.
/**
* @brief Density aware distribution of the data
*
* @param ifilename Dataset path
* @param ts_num Total data seties
* @param index iSAX index parameter
* @param parallelism_in_subtree Flag for index creation mode
* @param node_groups Node groups array
* @param total_node_groups Total number of node groups
* @param total_nodes_per_nodegroup Total number of nodes in each group
* @return float* Raw float array
*/
float *index_creation_node_groups_chatzakis_botao_prepro(const char *ifilename, long int ts_num, isax_index *index,
const char parallelism_in_subtree, pdr_node_group *node_groups,
int total_node_groups, int total_nodes_per_nodegroup);
Static Scheduling
/**
* @brief Static query scheduling
*
* @param ifilename Query file path
* @param q_num Total queries
* @param index Data series index
* @param minimum_distance The minimum starting distance (usually set to INF)
* @param basis_function_filename Priority queue size prediction parameters (only when workstealing is used)
* @param search_function Exact search function
* @param ws_search_function Workstealing exact search function (only when workstealing is used)
* @return float* An array holding the answer for each query
*/
float *isax_query_answering_pdr_baseline_chatzakis(const char *ifilename, int q_num, isax_index *index, float minimum_distance,
char *basis_function_filename,
query_result (*search_function)(search_function_params args),
query_result (*ws_search_function)(ws_search_function_params ws_args));
Static Query-Prediction Scheduling (Greedy)
/**
* @brief Dynamic Query Scheduling
*
* @param ifilename Query file path
* @param q_num Total queries
* @param index Data series index
* @param minimum_distance The minimum starting distance (usually set to INF)
* @param sort Sort on not the queries (only for predictions)
* @param estimations_filename File Containing the prediction for each query (only for predictions)
* @param basis_function_filename Priority queue size prediction parameters (only when workstealing is used)
* @param field_to_cmp Sort the queries based on initial BSF or estimation file
* @param search_function Exact search function
* @param approximate_search_func Approximate search function (only for predictions and initial-bsf mode)
* @param ws_search_function Workstealing exact search function (only when workstealing is used)
* @return float* An array holding the answer for each query
*/
float *isax_query_answering_pdr_greedy_chatzakis(const char *ifilename, int q_num, isax_index *index, float minimum_distance,
char *estimations_filename, char *basis_function_filename, char sort,
greedy_alg_comparing_field field_to_cmp,
query_result (*search_function)(search_function_params args),
query_result (*approximate_search_func)(ts_type *, ts_type *, isax_index *index),
query_result (*ws_search_function)(ws_search_function_params ws_args))
Dynamic Scheduling
/**
* @brief Dynamic Query Scheduling
*
* @param ifilename Query file path
* @param q_num Total queries
* @param index Data series index
* @param minimum_distance The minimum starting distance (usually set to INF)
* @param sort Sort on not the queries (only for predictions)
* @param estimations_filename File Containing the prediction for each query (only for predictions)
* @param basis_function_filename Priority queue size prediction parameters (only when workstealing is used)
* @param mode Coordinator module mode: Module, Thread or Coordinator
* @param field_to_cmp Sort the queries based on initial BSF or estimation file
* @param search_function Exact search function
* @param approximate_search_func Approximate search function (only for predictions and initial-bsf mode)
* @param ws_search_function Workstealing exact search function (only when workstealing is used)
* @return float* An array holding the answer for each query
*/
float *isax_query_answering_pdr_dynamic_chatzakis(const char *ifilename, int q_num, isax_index *index, float minimum_distance, char sort,
char *estimations_filename, char *basis_function_filename,
dynamic_modes mode,
greedy_alg_comparing_field field_to_cmp,
query_result (*search_function)(search_function_params args),
query_result (*approximate_search_func)(ts_type *, ts_type *, isax_index *index),
query_result (*ws_search_function)(ws_search_function_params ws_args));
The exact search routines use an input argument struct:
/**
* @brief Arguments needed for exact search routines
*
*/
typedef struct search_function_params
{
// Mandatory
int query_id;
ts_type *ts;
ts_type *paa;
isax_index *index;
node_list *nodelist;
float minimum_distance;
// Non-Mandatory
double (*estimation_func)(double);
communication_module_data *comm_data; // might cause include probs
float *shared_bsf_results;
} search_function_params;
/**
* @brief Arguments needed for workstealing routines of Odyssey
*
*/
typedef struct ws_search_function_params
{
// Mandatory
int query_id;
ts_type *ts;
ts_type *paa;
isax_index *index;
node_list *nodelist;
float minimum_distance;
float bsf;
int *batch_ids;
double (*estimation_func)(double);
communication_module_data *comm_data;
float *shared_bsf_results;
} ws_search_function_params;
The code currently supports the exact search routine of MESSI and the new, workstealing-supporting algorithm of Odyssey.
/**
* @brief MESSI exact search
*
* @param args Exact search arguments
* @return query_result Contains the float result of the query
*/
query_result exact_search_ParISnew_inmemory_hybrid_ekosmas(search_function_params args);
/**
* @brief Odyssey exact search
*
* @param args Exact search arguments
* @return query_result Contains the float result of the query
*/
query_result exact_search_workstealing_chatzakis(search_function_params args);
/**
* @brief Odyssey workstealing (helper node) algorithm
*
* @param ws_args Workstealing algorithms
* @return query_result Contains the float result of the query
*/
query_result exact_search_workstealing_helper_single_batch_chatzakis(ws_search_function_params ws_args);
To add new exact search routines, you need to define a function with the following prototype
query_result name_of_function(search_function_params args){
...
}
For the args parameter, you may
- Initialize the mandatory fields, indicated above
- Add any new field needed, as optional field
Then, you can provide the function as a parameter to any query scheduling function presented in the previous section.
Odyssey contains two generators: A random walk dataset generator (Each data point in the data series is produced as xi+1=N(xi,1), where N(0,1) is a standard normal distribution) and a query generator (Each query is extracted from the dataset and then noise is added).
To compile the generator:
cd dataset/dataset/
#gcc main.c -o generator -lm -lgsl -lgslcblas -lreadline
make
Usage Example:
# Generate 1000 z-normalized data series of length 256:
./generator --size 1000 --length 256 --z-normalize > /path/to/save/file.bin
Example/Usage scripts are provided in the corresponding directory.
To compile the generator
cd ads/run_automation/query_analysis/query_generation/
gcc query_generator.c -o query_generator -lm
Usage Example:
# Noise is usually set to 0.0-0.1
./query_generator $DATASET_PATH $QUERIES_PATH $QUERIES_NUMBER $DATASET_LEN $DATASET_SIZE $NOISE
Example/Usage scripts are provided in the corresponding directory.
We are evaluating the performance of Odyssey using the datasets listed below.
- Seismic: Contains seismic instrument recordings from stations around the world
- Astro: Represents celestial objects
- Deep: Contains deep embeddings originating from a convoltional neural network
- Sift: Contains image (sift) descriptors
We contucted our experimental evaluation with Odyssey algorithms using automated python scripts. Those scripts are configured to schedule specific types of experiments selecting and evaluating different modules of the code.
cd ads/run_automation/
python3 scheduler.py # after configuration
The results of the runs made by scheduler.py are saved in folders with strict hierarchy, that is later exploited by statistics.py script to produce the experimental results of the different runs.
cd ads/run_automation/
python3 statistics.py # after configuration
- Manos Chatzakis, University of Crete and ICS-FORTH ([email protected]), Currently at EPFL.
- Panagiota Fatourou, University of Crete, ICS-FORTH, University of Paris and LIPADE ([email protected])
- Eleftherios Kosmas, University of Crete ([email protected])
- Themis Palpanas, University of Paris and LIPADE ([email protected])
- Botao Peng, Chinese Academy of Sciences ([email protected])
Odyssey is developed under Platon research project.