Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for stats::formula objects #35

Open
schlichtanders opened this issue Dec 5, 2017 · 5 comments
Open

Add support for stats::formula objects #35

schlichtanders opened this issue Dec 5, 2017 · 5 comments

Comments

@schlichtanders
Copy link

at the package's README https://github.com/jpmml/r2pmml#model-formulae it says that one can use nice R syntax to define normal arithmetic processing of the data when using GLM or so

Are they also supported independently of LM/GLM, I mean to create simple models, just involving simple arithmetics.

If possible, can you provide example code? If not, can it be supported in general?

@vruusmann
Copy link
Member

In R there are two approaches for declaring label and features:

  1. "Matrix interface": model(x = features, y = label)
  2. "Formula interface": model(label ~ features, data = data)

Arithmetic operations are supported only with "formula interface", because this way they become part of the model object (eg. can be serialized/deserialized in RDS data format). However, the support for "formula interface" varies considerably between R packages - it is best supported by several built-in packages (eg. the base package, which provides glm() and lm() functions), reasonably supported by several others (eg. earth and randomForest packages), and not at all supported by many more.

You need to check the documentation of your target R package/function if it supports the "formula interface" or not.

If possible, can you provide example code?

See the following presentation:
https://www.slideshare.net/VilluRuusmann/converting-r-to-pmml-82182483

There are many in-formula feature engineering examples starting from slide 13.

@schlichtanders
Copy link
Author

schlichtanders commented Dec 5, 2017 via email

@vruusmann
Copy link
Member

These are exactly the things which I would like to have WITHOUT wrapping it into a linear model

You mean taking a stats::formula object, and converting it into a PMML fragment?

formula = as.formula(...)
r2pmml(formula, "formula.pmml")

What will happen to those PMML fragments afterwards? Want to copy-paste them manually to someplace else?

The PMML thinking is that formula objects cannot exist in isolation. They have to be associated with a model object or, alternatively, be converted to some-sort of function definition (typically a DerivedField element).

However, it would be possible to teach the r2pmml package to take notice of stats::formula objects, and emit a partial result in this case (ie. the results wouldn't be a complete PMML document, but a fragment of it).

@schlichtanders
Copy link
Author

thank you very much for the explanations and for paraphrasing my thoughts.
Now I really feel understood.

Thanks a lot!

@vruusmann
Copy link
Member

vruusmann commented Dec 7, 2017

Suppose you create a stats::formula object like this:

#library("r2pmml")
formula = as.formula(y ~ I(x1 + x2))
#r2pmml(formula, "formula.pmml")

A formula object could be translated to a singleton DerivedField element. However, this element cannot exist in isolation, there must be accompanying DataField elements that define its input and output fields (names, data and operational types, etc).

A corresponding PMML fragment might look like this:

<PMML>
  <DataDictionary>
    <DataField name="x1" dataType="double" optype="continuous"/>
    <DataField name="x2" dataType="double" optype="continuous"/>
  </DataDictionary>
  <TransformationDictionary>
    <DerivedField name="y" dataType="double" optype="continuous">
      <Apply function="+">
        <FieldRef field="x1"/>
        <FieldRef field="x2"/>
      </Apply>
    </DerivedField>
  </TransformationDictionary>
</PMML>

This kind of "partial conversion" can be very helpful if you're trying to convert a piece of R (or Python) code into PMML. It will be very easy to copy the above DataField and DerivedField elements and paste them into some other PMML document (that needs to be enhanced with more feature engineering logic).

@vruusmann vruusmann reopened this Dec 7, 2017
@vruusmann vruusmann changed the title Independent Preprocessing with model Formulae Add support for stats::formula objects Dec 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants