-
Notifications
You must be signed in to change notification settings - Fork 32
RapidMiner Integration
WInte.r provides capabilities to learn matching rules using WEKA as described in the section Learning Matching Rules. Additionally, numerous tools like RapidMiner exist to learn supervised machine learning models that allow a user to learn classifiers, which can be use for data matching. In order to give the user the tools of his or her choice, WInte.r allows the user to generate and export training data to learn these matching rule models outside of WInte.r. This documentation shows how to use RapidMiner to learn matching rule models and import them using the WEKA or PMML format back into WInte.r.
An example identity resolution process using RapidMiner can be found in the movies usecase. Additionally, a corresponding RapidMiner repository containing sample processes is provided, so that a user can easily run the whole process locally by adding the repository to the user's own RapidMiner repositories.
Each matching rule has an interface to export the training data, which is used to train a corresponding matching rule model. With this data, a matching rule model can be trained externally on the same training data as a matching rule model trained in WInte.r.
Inputs to the export method are the data sets which are being integrated, a goldstandard containing matches for those two data sets, and the designated file for the training data. Calling this method internally triggers the feature generation, which executes the matching rule's comparators to calculate similarities. Hence, comparators must be added to the matching rule using the export. Thus, each line of the training data csv file contains the calculated similarity values for each comparator, which was added to the matching rule before. As a matching rule may contain multiple comparators, the header of the comparator's values is pre-fixed with the comparator's position and the comparator's name. For example, the MovieDirectoryComparatorLevenshtein, which is the first comparator of the matching rule, has the header '[0] MovieTitleComparatorEqual'. Additionally, each line has a column 'label', which describes whether the line describes a match. This information on the label is inferred from the provided goldstandard. Hence, matching pairs from the goldstandard result in a "1" whereas non-matching pairs result in a "0".
[0] MovieTitleComparatorEqual | [1] MovieDateComparator2Years | ... | label |
---|---|---|---|
1 | 0.5 | ... | 1 |
0 | 0.5 | ... | 1 |
... | ... | ... | ... |
// Export Training Data
matchingRule.exportTrainingData(dataAcademyAwards, dataActors, gsTest, new File
("usecase/movie/output/optimisation/academy_awards_2_actors_features.csv"));
The screenshot below shows an example of a Rapidminer process, which trains a matching rule model using a decision tree learner. It contains three parts. First, the csv file containing the training data is loaded, afterwards the decision tree model is learned and finally the trained model is exported in the PMML format.
When importing the training data into RapidMiner a couple of rules have to be followed to ensure compatibility of the learned model with WInte.r.
When using the import wizard of the Read CSV operator, the user selects the correct file as used for the training data export in WInte.r: </your path to winte.r>/usecase/movie/Rapidminer/data/optimisation/academy_awards_2_actors_features.csv
During data formatting, do not rename or delete any column to keep the data schema as provided by WInte.r. Change the type of the column 'label' to binominal and set the role of this column to label. Afterwards, the data import can be finished by clicking the Finish button.
WInte.r supports the model formats of WEKA and PMML to import matching models. Thus, any model which is exported from RapidMiner, has to be in either of these formats. As the RapidMiner WEKA extension contains mainly operators which are already available in WInte.r as described in Learning Matching Rules, this documentation focuses on PMML models.
To export PMML models from RapidMiner, the RapidMiner PMML extension has to be installed. This extension provides the Write PMML operator to export machine learning models in the PMML format.
The list of models, which can be exported by the Write PMML operator, is limited and can be found in the operator's description. We tested the decision tree and the linear regression. Both accept the training data provided by the Read CSV operator as described above.
The trained model has to be passed to the Write PMML operator. Apart from the model, the user has to specify a location for the model: </your path to winte.r>/usecase/movie/Rapidminer/models/matchingModel.pmml.
Afterwards, the RapidMiner process can be executed to train and export a matching rule model.
The trained matching rule model can be loaded via the readModel(File file) method of the WekaMatchingRule.
// import matching rule model
matchingRule.readModel(new File("usecase/movie/input/model/matchingRule/matchingModel.pmml"));
The trained matching model can be applied to any matching task, that follows the same schema as defined for the training data.