* Added new method compatible to itemMatrix to check if the item codi…

…ng is compatible between two objects. * c() now produces a warning if two itemMatrices with different itemCoding are combined. * encode and recode accept now for itemLabels also objects with an itemLabels method. * recode is now also available for association (itemsets and rules). * recode: parameter match is now deprecated. * Fixed some TYPOs. * Added item hierarchy and item coding to vignette.
mhahsler · May 15, 2021 · 7d6cd0c · 7d6cd0c
1 parent 863d5d6
commit 7d6cd0c
Show file tree

Hide file tree

Showing 23 changed files with 379 additions and 125 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: arules
-Version: 1.6-7
-Date: 2021-03-12
+Version: 1.6-7.1
+Date: 2021-xx-xx
 Title: Mining Association Rules and Frequent Itemsets
 Authors@R: c(person("Michael", "Hahsler", role = c("aut", "cre", "cph"),
 		email = "[email protected]"),

diff --git a/NAMESPACE b/NAMESPACE
@@ -72,6 +72,7 @@ exportMethods(
   "aggregate",
   "abbreviate",
   "addComplement",
+  "compatible",
   "coverage",
   "crossTable",
   "c", 

diff --git a/NEWS.md b/NEWS.md
@@ -1,6 +1,14 @@
 # arules 1.6-7.1 (xx/xx/2021)
 
+## New Feature
+* Added new method compatible to itemMatrix to check if the item coding  is compatible 
+  between two objects.
+* c() now produces a warning if two itemMatrices with different itemCoding are combined.
+* encode and recode accept now for itemLabels also objects with an itemLabels method.
+* recode is now also available for association (itemsets and rules). 
 
+## Changes
+* recode: parameter match is now deprecated.
 
 ## Bug Fixes
 * fixed addAggregate problem with character (reported by javiercoh).
@@ -69,7 +77,7 @@
 * discretizeDF now reports which column produces the problem.
 
 ## Changes
-* transactions: numeric columns are now discretized during coersion using discretizeDF (with a warning).
+* transactions: numeric columns are now discretized during coercion using discretizeDF (with a warning).
 
 ## Bug Fixes
 * The spurious warning for reaching maxlen in apriori is now removed (reported by Ryan J. Cole).
@@ -104,7 +112,7 @@
 # arules 1.5-5 (01/09/2018)
 
 ## New Features
-* Added (absolut support) "count" as an interest measure. 
+* Added (absolute support) "count" as an interest measure. 
 * itemLabels can now be assigned for rules and itemsets. 
 
 ## Bug Fixes
@@ -132,7 +140,7 @@
 
 ## Bug Fixes
 * Improved PROTECT placement in C source code.
-* itemMeasures for single rules/itemssets now returns a proper data.frame 
+* itemMeasures for single rules/itemsets now returns a proper data.frame 
    (reported by lordbitin).
 * itemMeasures: Added missing parentheses in kappa calculation and fixed
     equation for least contradiction (reported by Feng Chen). 
@@ -252,7 +260,7 @@
 * subset extraction: added checks, handles now NAs and recycles for logical.
 * read.transactions gained arguments skip and quote and some defaults for
   read and write (uses now quotes and no rownames by default) have changed.
-* itemMatrix: coersion from matrix checks now for 0-1 matrix with a warning.
+* itemMatrix: coercion from matrix checks now for 0-1 matrix with a warning.
 * APRIORI and ECLAT report now absolute minimum support.
 * APRIORI: out-of-memory while rule building does now result in an error and
   not a memory fault.

diff --git a/R/AllGenerics.R b/R/AllGenerics.R
@@ -45,6 +45,9 @@ setGeneric("DATAFRAME",
 setGeneric("addComplement",
     function(x, labels, complementLabels=NULL) standardGeneric("addComplement"))
 
+setGeneric("compatible",
+  function(x, y) standardGeneric("compatible"))
+
 setGeneric("coverage",
     function(x, transactions = NULL, reuse = TRUE) standardGeneric("coverage"))
 

diff --git a/R/itemCoding.R b/R/itemCoding.R
@@ -36,7 +36,9 @@ setMethod("encode", signature(x = "character"),
         ## regular encoding
         r <- which(itemLabels %in% x)
         if (length(r) < length(x))
-            stop("Unknown item label(s) in ", deparse(x))
+            warning("The following item labels are not available in itemLabels: ",
+                paste(setdiff(x, itemLabels), collapse = ", "), 
+                "\nItems with missing labels are dropped!", call. = FALSE)
         r
     }
 )
@@ -47,27 +49,32 @@ setMethod("encode", signature(x = "numeric"),
         if (itemMatrix == TRUE) 
             return(encode(list(x), itemLabels, itemMatrix == TRUE))
 
-
         ## handle empty sets
         if (length(x)==0) return(integer(0))
 
         ## regular encoding
         r <- range(x)
         if (r[1] < 1 || r[2] > length(itemLabels))
-            stop("Invalid range in ", deparse(x))
+            stop("Invalid item ID in ", deparse(x), call. = FALSE)
+
+        ## deal with numeric
         if (!is.integer(x)) {
-            if (!all.equal(x, (i <- as.integer(x))))
-                stop("Invalid numeric values in ", deparse(x))
-            i
-        } else
-            x
+            if (any(x %% 1 != 0))
+                stop("Invalid item ID (needs to be integer) in ", deparse(x), call. = FALSE)
+            x <- as.integer(x)
+        }
+        x
     }
 )
 
 ## NOTES this is less error prone than creating ngCMatrix
 ##       directly in internal code.
 setMethod("encode", signature(x = "list"),
     function(x, itemLabels, itemMatrix = TRUE) {
+        if(is(itemLabels, "itemMatrix") || 
+                is(itemLabels, "association")) itemLabels <- itemLabels(itemLabels)
+
+        # this calls encode for character
         i <- lapply(x, encode, itemLabels, itemMatrix = FALSE)
         if (itemMatrix == FALSE) 
             return(i)
@@ -99,22 +106,32 @@ setMethod("encode", signature(x = "list"),
 ## recode to make compatible
 setMethod("recode", signature(x = "itemMatrix"),
     function(x, itemLabels = NULL, match = NULL) {
+
+        ### FIXME: Deprecated
+        if(!is.null(match)) message("recode: parameter 'match' is deprecated. Use 'itemLabels' instead.")
+
         if(!is.null(itemLabels) && !is.null(match))
             stop("'match' and 'itemLabels' cannot both be specified")
         if(is.null(itemLabels)) 
             if(is.null(match))  stop("Either 'match' or 'itemLabels' has to be specified")
         else itemLabels <- itemLabels(match)            
+        ### END 
 
+        if(is(itemLabels, "itemMatrix") || 
+                is(itemLabels, "association")) itemLabels <- itemLabels(itemLabels)
+
+	    ## nothing to do
+	    if(identical(itemLabels(x), itemLabels)) return(x)
+
         k <- match(itemLabels(x), itemLabels)
         if (any(is.na(k)))
-            stop ("All item labels in x must be contained in ",
-                  "'itemLabels' or 'match'.")
+            stop ("All item labels in x must be contained in 'itemLabels'.", call. = FALSE)
 
         ## recode items
         if (any(k != seq(length(k))))
             x@data <- .Call(R_recode_ngCMatrix, x@data, k)
 
-        ## enlarge
+        ## enlarge matrix for additional items
         if (x@data@Dim[1] <  length(itemLabels))
             x@data@Dim[1] <- length(itemLabels)
 
@@ -129,4 +146,27 @@ setMethod("recode", signature(x = "itemMatrix"),
     }
 )	
 
+setMethod("recode", signature(x = "itemsets"),
+    function(x, itemLabels = NULL, match = NULL) {
+        x@items <- recode(x@items, itemLabels, match)
+        x
+    }
+)
+
+setMethod("recode", signature(x = "rules"),
+    function(x, itemLabels = NULL, match = NULL) {
+        x@lhs <- recode(x@lhs, itemLabels, match)
+        x@rhs <- recode(x@rhs, itemLabels, match)
+        x
+    }
+)   
+
+setMethod("compatible", signature(x = "itemMatrix"),
+    function(x, y) identical(itemLabels(x), itemLabels(y))
+)
+
+setMethod("compatible", signature(x = "associations"),
+    function(x, y) identical(itemLabels(x), itemLabels(y))
+)
+
 ###
diff --git a/R/itemMatrix.R b/R/itemMatrix.R
@@ -133,12 +133,9 @@ setAs("itemMatrix", "list",
 
 setMethod("LIST", signature(from = "itemMatrix"),
   function(from, decode = TRUE) {
-    if (decode) {
-      to <- .Call(R_asList_ngCMatrix, from@data, itemLabels(from))
-      names(to) <- itemsetInfo(from)[["itemsetID"]]
-      to
-    } else
-      .Call(R_asList_ngCMatrix, from@data, NULL)
+    l <- .Call(R_asList_ngCMatrix, from@data, if(decode) itemLabels(from) else NULL)
+    if(decode) names(l) <- itemsetInfo(from)[["itemsetID"]]
+    l
   }
 )
 
@@ -331,21 +328,30 @@ setMethod("c", signature(x = "itemMatrix"),
     for (y in args) {
       if (!is(y, "itemMatrix"))
         stop("can only combine itemMatrix")
+
       x@itemsetInfo <- .combineMeta(x, y, "itemsetInfo")
-      k <- match(itemLabels(y), itemLabels(x))
-      n <- which(is.na(k))
-      if (length(n)) {
-        k[n] <- x@data@Dim[1] + seq(length(n))
-        x@data@Dim[1] <- x@data@Dim[1] + length(n)
-        x@itemInfo <- rbind(x@itemInfo, 
-          y@itemInfo[n,, drop = FALSE])
+
+      if(!compatible(x, y)) {
+        warning("Item coding not compatible, recoding item matrices.")
+
+        # expand x if y has additional items
+        k <- match(itemLabels(y), itemLabels(x))
+        n <- which(is.na(k))
+        if (length(n)) {
+          k[n] <- x@data@Dim[1] + seq(length(n))
+          x@data@Dim[1] <- x@data@Dim[1] + length(n)
+          x@itemInfo <- rbind(x@itemInfo, 
+            y@itemInfo[n,, drop = FALSE])
+        }
+
+        # recode y to match x
+        if (any(k != seq_len(length(k))))
+          y@data <- .Call(R_recode_ngCMatrix, y@data, k)
+        if (y@data@Dim[1] <  x@data@Dim[1])
+          y@data@Dim[1] <- x@data@Dim[1]
       }
-      if (any(k != seq_len(length(k))))
-        y@data <- .Call(R_recode_ngCMatrix, y@data, k)
-      if (y@data@Dim[1] <  x@data@Dim[1])
-        y@data@Dim[1] <- x@data@Dim[1]
 
-      ## this is faste than x@data <- cbind(x@data, y@data)
+      ## this is faster than x@data <- cbind(x@data, y@data)
       x@data <- .Call(R_cbind_ngCMatrix, x@data, y@data)
     }
     validObject(x, complete = TRUE)
@@ -396,16 +402,22 @@ setMethod("unique", signature(x = "itemMatrix"),
 ## and uses more efficient prefix tree C code
 setMethod("match", signature(x = "itemMatrix", table = "itemMatrix"),
   function(x, table, nomatch = NA_integer_, incomparables = NULL) {
-    k <- match(itemLabels(x), itemLabels(table))
-    n <- which(is.na(k))
-    if (length(n)) {
-      k[n] <- table@data@Dim[1] + seq(length(n))
-      table@data@Dim[1] <- table@data@Dim[1] + length(n)
+
+    if(!compatible(x, table)) {
+      warning("Item coding not compatible, recoding item matrices first.")
+
+      k <- match(itemLabels(x), itemLabels(table))
+      n <- which(is.na(k))
+      if (length(n)) {
+        k[n] <- table@data@Dim[1] + seq(length(n))
+        table@data@Dim[1] <- table@data@Dim[1] + length(n)
+      }
+      if (any(k != seq_len(length(k))))
+        x@data <- .Call(R_recode_ngCMatrix, x@data, k)
+      if (x@data@Dim[1] <  table@data@Dim[1])
+        x@data@Dim[1] <- table@data@Dim[1]
     }
-    if (any(k != seq_len(length(k))))
-      x@data <- .Call(R_recode_ngCMatrix, x@data, k)
-    if (x@data@Dim[1] <  table@data@Dim[1])
-      x@data@Dim[1] <- table@data@Dim[1]
+
     i <- .Call(R_pnindex, table@data, x@data, FALSE)
     match(i, seq_len(length(table)), nomatch = nomatch, 
       incomparables = incomparables)

diff --git a/R/transactions.R b/R/transactions.R
@@ -59,7 +59,7 @@ setMethod("LIST", signature(from = "transactions"),
   function(from, decode = TRUE) {
     l <- LIST(as(from, "itemMatrix"), decode)
     if(decode) names(l) <- transactionInfo(from)$transactionID
-  l  
+    l  
   })
 
 setAs("data.frame", "transactions",

diff --git a/README.md b/README.md
@@ -164,5 +164,4 @@ Questions should be posted on [stackoverflow and tagged with arules](https://sta
 
 * Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, and Christian Buchta. [The arules R-package ecosystem: Analyzing interesting patterns from large transaction datasets.](https://jmlr.csail.mit.edu/papers/v12/hahsler11a.html) _Journal of Machine Learning Research,_ 12:1977-1981, 2011.
 * Michael Hahsler, Bettina Gr&uuml;n and Kurt Hornik. [arules - A Computational Environment for Mining Association Rules and Frequent Item Sets.](https://dx.doi.org/10.18637/jss.v014.i15) _Journal of Statistical Software,_ 14(15), 2005.
-* Hahsler, Michael. 
-[A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules](https://michael.hahsler.net/research/association_rules/measures.html), 2015, URL: https://michael.hahsler.net/research/association_rules/measures.html.
+* Hahsler, Michael. [A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules](https://michael.hahsler.net/research/association_rules/measures.html), 2015, URL: https://michael.hahsler.net/research/association_rules/measures.html.
diff --git a/man/Mushroom.Rd b/man/Mushroom.Rd
@@ -9,7 +9,7 @@
   It contains 
   information about 8124 mushrooms (transactions). 
   4208 (51.8\%) are edible and 3916 (48.2\%)
-  are poisonous. The data contains 22 nomoinal features plus the class attribure 
+  are poisonous. The data contains 22 nominal features plus the class attribute 
   (edible or not). These features were translated into 114 items. 
 
 }

diff --git a/man/apriori.Rd b/man/apriori.Rd
@@ -16,7 +16,7 @@ apriori(data, parameter = NULL, appearance = NULL, control = NULL)
     \code{\linkS4class{transactions}} or any data structure
     which can be coerced into
     \code{\linkS4class{transactions}} (e.g., a binary
-    matrix or data.frame).}
+    matrix or a data.frame).}
   \item{parameter}{object of class
     \code{\linkS4class{APparameter}} or named list.
     The default behavior is to mine rules with minimum support of 0.1,
@@ -33,7 +33,10 @@ apriori(data, parameter = NULL, appearance = NULL, control = NULL)
     algorithm (item sorting, report progress (verbose), etc.)}
 }
 \details{
-  \bold{Automatic conversion to transactions.}
+  \bold{Warning about automatic conversion of matrices or data.frames to transactions.}
+  It is preferred to coerce data to transactions manually before calling \code{apriori} to have control over item coding. This is especially important when you are working with multiple datasets or several subsets of the same dataset. To read about item coding, see
+  \code{\link{itemCoding}}.
+
   If a data.frame is specified as \code{x}, then the data is automatically converted 
   into transactions by discretizing numeric data using \code{discretizeDF} and then 
   coercion to transactions. The discretization may fail if the data is not well behaved.
@@ -99,6 +102,10 @@ apriori(data, parameter = NULL, appearance = NULL, control = NULL)
 \author{Michael Hahsler and Bettina Gruen}
 \examples{
 data("Adult")
+## Note: Adult is alread a transactions dataset if you are using a data.frame then 
+##       you should coerce it first to transactions using:
+##                    yourTrans <- as(yourData, "transactions")
+
 ## Mine association rules.
 rules <- apriori(Adult, 
 	parameter = list(supp = 0.5, conf = 0.9, target = "rules"))
@@ -108,6 +115,7 @@ summary(rules)
   \code{\link{APparameter-class}},
   \code{\link{APcontrol-class}},
   \code{\link{APappearance-class}},
+  \code{\link{itemCoding}},
   \code{\link{transactions-class}},
   \code{\link{itemsets-class}},
   \code{\link{rules-class}}

diff --git a/man/associations-class.Rd b/man/associations-class.Rd
@@ -29,7 +29,7 @@
   associations.
 }
 \details{
-The implementations of \code{associations} store itemsets (e.g., the LHS and RHS of a rule) as objects of class \code{\link{itemMatrix}} (i.e., sparse binary matrices). Quality measures (e.g., support) are stored in a data.frame accessable via method \code{quality}.
+The implementations of \code{associations} store itemsets (e.g., the LHS and RHS of a rule) as objects of class \code{\link{itemMatrix}} (i.e., sparse binary matrices). Quality measures (e.g., support) are stored in a data.frame accessible via method \code{quality}.
 
 Associations can store multisets with duplicated
 elements. Duplicated elements can result from combining several sets of associations. 

diff --git a/man/crossTable.Rd b/man/crossTable.Rd
@@ -14,9 +14,9 @@ crossTable(x, ...)
 \arguments{
   \item{x}{ object to be cross-tabulated 
     (\code{transactions} or \code{itemMatrix}).}
-  \item{measure}{ measure to return. Default is co-occurence counts. } 
+  \item{measure}{ measure to return. Default is co-occurrence counts. } 
   \item{sort}{ sort the items by support. } 
-  \item{...}{ aditional arguments. } 
+  \item{...}{ additional arguments. } 
 }
 \value{
   A symmetric matrix of n time n, where n is the number of items times 

diff --git a/man/discretize.Rd b/man/discretize.Rd
@@ -54,7 +54,7 @@ Discretize calculates breaks between intervals using various methods and then us
 Discretization may fail for several reasons. Some reasons are
 \itemize{
 \item A variable contains only a single value. In this case, the variable should be dropped or directly converted into a factor with a single level (see \code{\link{factor}}).
-\item Some caclulated breaks are not unique. This can happen for method frequency with very skewed data (e.g., a large portion of the values is 0). In this case, non-unique breaks are dropped with a warning. It would be probably better to look at the histogram of the data and decide on breaks for the method fixed.
+\item Some calculated breaks are not unique. This can happen for method frequency with very skewed data (e.g., a large portion of the values is 0). In this case, non-unique breaks are dropped with a warning. It would be probably better to look at the histogram of the data and decide on breaks for the method fixed.
 }
 
 \code{discretize} only implements unsupervised discretization. See 

diff --git a/man/is.superset.Rd b/man/is.superset.Rd
@@ -22,7 +22,7 @@ is.superset(x, y = NULL, proper = FALSE, sparse = TRUE, ...)
     the super or subset structure within set \code{x} is calculated.}
   \item{proper}{a logical indicating if all or just proper super or subsets.}
   \item{sparse}{a logical indicating if a sparse (ngCMatrix) rather than a 
-  dense logical matrix sgould be returned. Sparse computation 
+  dense logical matrix should be returned. Sparse computation 
   preserves a significant amount of memory and is much faster for large sets.}
   \item{\dots}{ currently unused.}
 }