Skip to content

Commit

Permalink
* Added new method compatible to itemMatrix to check if the item codi…
Browse files Browse the repository at this point in the history
…ng is compatible between two objects.

* c() now produces a warning if two itemMatrices with different itemCoding are combined.
* encode and recode accept now for itemLabels also objects with an itemLabels method.
* recode is now also available for association (itemsets and rules).
* recode: parameter match is now deprecated.
* Fixed some TYPOs.
* Added item hierarchy and item coding to vignette.
  • Loading branch information
mhahsler committed May 15, 2021
1 parent 863d5d6 commit 7d6cd0c
Show file tree
Hide file tree
Showing 23 changed files with 379 additions and 125 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: arules
Version: 1.6-7
Date: 2021-03-12
Version: 1.6-7.1
Date: 2021-xx-xx
Title: Mining Association Rules and Frequent Itemsets
Authors@R: c(person("Michael", "Hahsler", role = c("aut", "cre", "cph"),
email = "[email protected]"),
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ exportMethods(
"aggregate",
"abbreviate",
"addComplement",
"compatible",
"coverage",
"crossTable",
"c",
Expand Down
16 changes: 12 additions & 4 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
# arules 1.6-7.1 (xx/xx/2021)

## New Feature
* Added new method compatible to itemMatrix to check if the item coding is compatible
between two objects.
* c() now produces a warning if two itemMatrices with different itemCoding are combined.
* encode and recode accept now for itemLabels also objects with an itemLabels method.
* recode is now also available for association (itemsets and rules).

## Changes
* recode: parameter match is now deprecated.

## Bug Fixes
* fixed addAggregate problem with character (reported by javiercoh).
Expand Down Expand Up @@ -69,7 +77,7 @@
* discretizeDF now reports which column produces the problem.

## Changes
* transactions: numeric columns are now discretized during coersion using discretizeDF (with a warning).
* transactions: numeric columns are now discretized during coercion using discretizeDF (with a warning).

## Bug Fixes
* The spurious warning for reaching maxlen in apriori is now removed (reported by Ryan J. Cole).
Expand Down Expand Up @@ -104,7 +112,7 @@
# arules 1.5-5 (01/09/2018)

## New Features
* Added (absolut support) "count" as an interest measure.
* Added (absolute support) "count" as an interest measure.
* itemLabels can now be assigned for rules and itemsets.

## Bug Fixes
Expand Down Expand Up @@ -132,7 +140,7 @@

## Bug Fixes
* Improved PROTECT placement in C source code.
* itemMeasures for single rules/itemssets now returns a proper data.frame
* itemMeasures for single rules/itemsets now returns a proper data.frame
(reported by lordbitin).
* itemMeasures: Added missing parentheses in kappa calculation and fixed
equation for least contradiction (reported by Feng Chen).
Expand Down Expand Up @@ -252,7 +260,7 @@
* subset extraction: added checks, handles now NAs and recycles for logical.
* read.transactions gained arguments skip and quote and some defaults for
read and write (uses now quotes and no rownames by default) have changed.
* itemMatrix: coersion from matrix checks now for 0-1 matrix with a warning.
* itemMatrix: coercion from matrix checks now for 0-1 matrix with a warning.
* APRIORI and ECLAT report now absolute minimum support.
* APRIORI: out-of-memory while rule building does now result in an error and
not a memory fault.
Expand Down
3 changes: 3 additions & 0 deletions R/AllGenerics.R
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ setGeneric("DATAFRAME",
setGeneric("addComplement",
function(x, labels, complementLabels=NULL) standardGeneric("addComplement"))

setGeneric("compatible",
function(x, y) standardGeneric("compatible"))

setGeneric("coverage",
function(x, transactions = NULL, reuse = TRUE) standardGeneric("coverage"))

Expand Down
62 changes: 51 additions & 11 deletions R/itemCoding.R
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,9 @@ setMethod("encode", signature(x = "character"),
## regular encoding
r <- which(itemLabels %in% x)
if (length(r) < length(x))
stop("Unknown item label(s) in ", deparse(x))
warning("The following item labels are not available in itemLabels: ",
paste(setdiff(x, itemLabels), collapse = ", "),
"\nItems with missing labels are dropped!", call. = FALSE)
r
}
)
Expand All @@ -47,27 +49,32 @@ setMethod("encode", signature(x = "numeric"),
if (itemMatrix == TRUE)
return(encode(list(x), itemLabels, itemMatrix == TRUE))


## handle empty sets
if (length(x)==0) return(integer(0))

## regular encoding
r <- range(x)
if (r[1] < 1 || r[2] > length(itemLabels))
stop("Invalid range in ", deparse(x))
stop("Invalid item ID in ", deparse(x), call. = FALSE)

## deal with numeric
if (!is.integer(x)) {
if (!all.equal(x, (i <- as.integer(x))))
stop("Invalid numeric values in ", deparse(x))
i
} else
x
if (any(x %% 1 != 0))
stop("Invalid item ID (needs to be integer) in ", deparse(x), call. = FALSE)
x <- as.integer(x)
}
x
}
)

## NOTES this is less error prone than creating ngCMatrix
## directly in internal code.
setMethod("encode", signature(x = "list"),
function(x, itemLabels, itemMatrix = TRUE) {
if(is(itemLabels, "itemMatrix") ||
is(itemLabels, "association")) itemLabels <- itemLabels(itemLabels)

# this calls encode for character
i <- lapply(x, encode, itemLabels, itemMatrix = FALSE)
if (itemMatrix == FALSE)
return(i)
Expand Down Expand Up @@ -99,22 +106,32 @@ setMethod("encode", signature(x = "list"),
## recode to make compatible
setMethod("recode", signature(x = "itemMatrix"),
function(x, itemLabels = NULL, match = NULL) {

### FIXME: Deprecated
if(!is.null(match)) message("recode: parameter 'match' is deprecated. Use 'itemLabels' instead.")

if(!is.null(itemLabels) && !is.null(match))
stop("'match' and 'itemLabels' cannot both be specified")
if(is.null(itemLabels))
if(is.null(match)) stop("Either 'match' or 'itemLabels' has to be specified")
else itemLabels <- itemLabels(match)
### END

if(is(itemLabels, "itemMatrix") ||
is(itemLabels, "association")) itemLabels <- itemLabels(itemLabels)

## nothing to do
if(identical(itemLabels(x), itemLabels)) return(x)

k <- match(itemLabels(x), itemLabels)
if (any(is.na(k)))
stop ("All item labels in x must be contained in ",
"'itemLabels' or 'match'.")
stop ("All item labels in x must be contained in 'itemLabels'.", call. = FALSE)

## recode items
if (any(k != seq(length(k))))
x@data <- .Call(R_recode_ngCMatrix, x@data, k)

## enlarge
## enlarge matrix for additional items
if (x@data@Dim[1] < length(itemLabels))
x@data@Dim[1] <- length(itemLabels)

Expand All @@ -129,4 +146,27 @@ setMethod("recode", signature(x = "itemMatrix"),
}
)

setMethod("recode", signature(x = "itemsets"),
function(x, itemLabels = NULL, match = NULL) {
x@items <- recode(x@items, itemLabels, match)
x
}
)

setMethod("recode", signature(x = "rules"),
function(x, itemLabels = NULL, match = NULL) {
x@lhs <- recode(x@lhs, itemLabels, match)
x@rhs <- recode(x@rhs, itemLabels, match)
x
}
)

setMethod("compatible", signature(x = "itemMatrix"),
function(x, y) identical(itemLabels(x), itemLabels(y))
)

setMethod("compatible", signature(x = "associations"),
function(x, y) identical(itemLabels(x), itemLabels(y))
)

###
66 changes: 39 additions & 27 deletions R/itemMatrix.R
Original file line number Diff line number Diff line change
Expand Up @@ -133,12 +133,9 @@ setAs("itemMatrix", "list",

setMethod("LIST", signature(from = "itemMatrix"),
function(from, decode = TRUE) {
if (decode) {
to <- .Call(R_asList_ngCMatrix, from@data, itemLabels(from))
names(to) <- itemsetInfo(from)[["itemsetID"]]
to
} else
.Call(R_asList_ngCMatrix, from@data, NULL)
l <- .Call(R_asList_ngCMatrix, from@data, if(decode) itemLabels(from) else NULL)
if(decode) names(l) <- itemsetInfo(from)[["itemsetID"]]
l
}
)

Expand Down Expand Up @@ -331,21 +328,30 @@ setMethod("c", signature(x = "itemMatrix"),
for (y in args) {
if (!is(y, "itemMatrix"))
stop("can only combine itemMatrix")

x@itemsetInfo <- .combineMeta(x, y, "itemsetInfo")
k <- match(itemLabels(y), itemLabels(x))
n <- which(is.na(k))
if (length(n)) {
k[n] <- x@data@Dim[1] + seq(length(n))
x@data@Dim[1] <- x@data@Dim[1] + length(n)
x@itemInfo <- rbind(x@itemInfo,
y@itemInfo[n,, drop = FALSE])

if(!compatible(x, y)) {
warning("Item coding not compatible, recoding item matrices.")

# expand x if y has additional items
k <- match(itemLabels(y), itemLabels(x))
n <- which(is.na(k))
if (length(n)) {
k[n] <- x@data@Dim[1] + seq(length(n))
x@data@Dim[1] <- x@data@Dim[1] + length(n)
x@itemInfo <- rbind(x@itemInfo,
y@itemInfo[n,, drop = FALSE])
}

# recode y to match x
if (any(k != seq_len(length(k))))
y@data <- .Call(R_recode_ngCMatrix, y@data, k)
if (y@data@Dim[1] < x@data@Dim[1])
y@data@Dim[1] <- x@data@Dim[1]
}
if (any(k != seq_len(length(k))))
y@data <- .Call(R_recode_ngCMatrix, y@data, k)
if (y@data@Dim[1] < x@data@Dim[1])
y@data@Dim[1] <- x@data@Dim[1]

## this is faste than x@data <- cbind(x@data, y@data)
## this is faster than x@data <- cbind(x@data, y@data)
x@data <- .Call(R_cbind_ngCMatrix, x@data, y@data)
}
validObject(x, complete = TRUE)
Expand Down Expand Up @@ -396,16 +402,22 @@ setMethod("unique", signature(x = "itemMatrix"),
## and uses more efficient prefix tree C code
setMethod("match", signature(x = "itemMatrix", table = "itemMatrix"),
function(x, table, nomatch = NA_integer_, incomparables = NULL) {
k <- match(itemLabels(x), itemLabels(table))
n <- which(is.na(k))
if (length(n)) {
k[n] <- table@data@Dim[1] + seq(length(n))
table@data@Dim[1] <- table@data@Dim[1] + length(n)

if(!compatible(x, table)) {
warning("Item coding not compatible, recoding item matrices first.")

k <- match(itemLabels(x), itemLabels(table))
n <- which(is.na(k))
if (length(n)) {
k[n] <- table@data@Dim[1] + seq(length(n))
table@data@Dim[1] <- table@data@Dim[1] + length(n)
}
if (any(k != seq_len(length(k))))
x@data <- .Call(R_recode_ngCMatrix, x@data, k)
if (x@data@Dim[1] < table@data@Dim[1])
x@data@Dim[1] <- table@data@Dim[1]
}
if (any(k != seq_len(length(k))))
x@data <- .Call(R_recode_ngCMatrix, x@data, k)
if (x@data@Dim[1] < table@data@Dim[1])
x@data@Dim[1] <- table@data@Dim[1]

i <- .Call(R_pnindex, table@data, x@data, FALSE)
match(i, seq_len(length(table)), nomatch = nomatch,
incomparables = incomparables)
Expand Down
2 changes: 1 addition & 1 deletion R/transactions.R
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ setMethod("LIST", signature(from = "transactions"),
function(from, decode = TRUE) {
l <- LIST(as(from, "itemMatrix"), decode)
if(decode) names(l) <- transactionInfo(from)$transactionID
l
l
})

setAs("data.frame", "transactions",
Expand Down
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,5 +164,4 @@ Questions should be posted on [stackoverflow and tagged with arules](https://sta

* Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, and Christian Buchta. [The arules R-package ecosystem: Analyzing interesting patterns from large transaction datasets.](https://jmlr.csail.mit.edu/papers/v12/hahsler11a.html) _Journal of Machine Learning Research,_ 12:1977-1981, 2011.
* Michael Hahsler, Bettina Gr&uuml;n and Kurt Hornik. [arules - A Computational Environment for Mining Association Rules and Frequent Item Sets.](https://dx.doi.org/10.18637/jss.v014.i15) _Journal of Statistical Software,_ 14(15), 2005.
* Hahsler, Michael.
[A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules](https://michael.hahsler.net/research/association_rules/measures.html), 2015, URL: https://michael.hahsler.net/research/association_rules/measures.html.
* Hahsler, Michael. [A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules](https://michael.hahsler.net/research/association_rules/measures.html), 2015, URL: https://michael.hahsler.net/research/association_rules/measures.html.
2 changes: 1 addition & 1 deletion man/Mushroom.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
It contains
information about 8124 mushrooms (transactions).
4208 (51.8\%) are edible and 3916 (48.2\%)
are poisonous. The data contains 22 nomoinal features plus the class attribure
are poisonous. The data contains 22 nominal features plus the class attribute
(edible or not). These features were translated into 114 items.

}
Expand Down
12 changes: 10 additions & 2 deletions man/apriori.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ apriori(data, parameter = NULL, appearance = NULL, control = NULL)
\code{\linkS4class{transactions}} or any data structure
which can be coerced into
\code{\linkS4class{transactions}} (e.g., a binary
matrix or data.frame).}
matrix or a data.frame).}
\item{parameter}{object of class
\code{\linkS4class{APparameter}} or named list.
The default behavior is to mine rules with minimum support of 0.1,
Expand All @@ -33,7 +33,10 @@ apriori(data, parameter = NULL, appearance = NULL, control = NULL)
algorithm (item sorting, report progress (verbose), etc.)}
}
\details{
\bold{Automatic conversion to transactions.}
\bold{Warning about automatic conversion of matrices or data.frames to transactions.}
It is preferred to coerce data to transactions manually before calling \code{apriori} to have control over item coding. This is especially important when you are working with multiple datasets or several subsets of the same dataset. To read about item coding, see
\code{\link{itemCoding}}.

If a data.frame is specified as \code{x}, then the data is automatically converted
into transactions by discretizing numeric data using \code{discretizeDF} and then
coercion to transactions. The discretization may fail if the data is not well behaved.
Expand Down Expand Up @@ -99,6 +102,10 @@ apriori(data, parameter = NULL, appearance = NULL, control = NULL)
\author{Michael Hahsler and Bettina Gruen}
\examples{
data("Adult")
## Note: Adult is alread a transactions dataset if you are using a data.frame then
## you should coerce it first to transactions using:
## yourTrans <- as(yourData, "transactions")
## Mine association rules.
rules <- apriori(Adult,
parameter = list(supp = 0.5, conf = 0.9, target = "rules"))
Expand All @@ -108,6 +115,7 @@ summary(rules)
\code{\link{APparameter-class}},
\code{\link{APcontrol-class}},
\code{\link{APappearance-class}},
\code{\link{itemCoding}},
\code{\link{transactions-class}},
\code{\link{itemsets-class}},
\code{\link{rules-class}}
Expand Down
2 changes: 1 addition & 1 deletion man/associations-class.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
associations.
}
\details{
The implementations of \code{associations} store itemsets (e.g., the LHS and RHS of a rule) as objects of class \code{\link{itemMatrix}} (i.e., sparse binary matrices). Quality measures (e.g., support) are stored in a data.frame accessable via method \code{quality}.
The implementations of \code{associations} store itemsets (e.g., the LHS and RHS of a rule) as objects of class \code{\link{itemMatrix}} (i.e., sparse binary matrices). Quality measures (e.g., support) are stored in a data.frame accessible via method \code{quality}.

Associations can store multisets with duplicated
elements. Duplicated elements can result from combining several sets of associations.
Expand Down
4 changes: 2 additions & 2 deletions man/crossTable.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ crossTable(x, ...)
\arguments{
\item{x}{ object to be cross-tabulated
(\code{transactions} or \code{itemMatrix}).}
\item{measure}{ measure to return. Default is co-occurence counts. }
\item{measure}{ measure to return. Default is co-occurrence counts. }
\item{sort}{ sort the items by support. }
\item{...}{ aditional arguments. }
\item{...}{ additional arguments. }
}
\value{
A symmetric matrix of n time n, where n is the number of items times
Expand Down
2 changes: 1 addition & 1 deletion man/discretize.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Discretize calculates breaks between intervals using various methods and then us
Discretization may fail for several reasons. Some reasons are
\itemize{
\item A variable contains only a single value. In this case, the variable should be dropped or directly converted into a factor with a single level (see \code{\link{factor}}).
\item Some caclulated breaks are not unique. This can happen for method frequency with very skewed data (e.g., a large portion of the values is 0). In this case, non-unique breaks are dropped with a warning. It would be probably better to look at the histogram of the data and decide on breaks for the method fixed.
\item Some calculated breaks are not unique. This can happen for method frequency with very skewed data (e.g., a large portion of the values is 0). In this case, non-unique breaks are dropped with a warning. It would be probably better to look at the histogram of the data and decide on breaks for the method fixed.
}

\code{discretize} only implements unsupervised discretization. See
Expand Down
2 changes: 1 addition & 1 deletion man/is.superset.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ is.superset(x, y = NULL, proper = FALSE, sparse = TRUE, ...)
the super or subset structure within set \code{x} is calculated.}
\item{proper}{a logical indicating if all or just proper super or subsets.}
\item{sparse}{a logical indicating if a sparse (ngCMatrix) rather than a
dense logical matrix sgould be returned. Sparse computation
dense logical matrix should be returned. Sparse computation
preserves a significant amount of memory and is much faster for large sets.}
\item{\dots}{ currently unused.}
}
Expand Down
Loading

0 comments on commit 7d6cd0c

Please sign in to comment.