Title: | Nearest Neighbor Observation Imputation and Evaluation Tools |
---|---|
Description: | Performs nearest neighbor-based imputation using one or more alternative approaches to processing multivariate data. These include methods based on canonical correlation: analysis, canonical correspondence analysis, and a multivariate adaptation of the random forest classification and regression techniques of Leo Breiman and Adele Cutler. Additional methods are also offered. The package includes functions for comparing the results from running alternative techniques, detecting imputation targets that are notably distant from reference observations, detecting and correcting for bias, bootstrapping and building ensemble imputations, and mapping results. |
Authors: | Jeffrey S. Evans [aut, cre] , Nicholas L. Crookston [aut], Andrew O. Finley [aut], John Coulston (Sunil Arya and David Mount for ANN) [ctb, com] |
Maintainer: | Jeffrey S. Evans <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0-35 |
Built: | 2025-01-07 06:15:34 UTC |
Source: | https://github.com/jeffreyevans/yaimpute |
Given a set of reference data points ,
ann
constructs a
kd-tree or box-decomposition tree (bd-tree) for efficient -nearest
neighbor searches.
ann(ref, target, k=1, eps=0.0, tree.type="kd", search.type="standard", bucket.size=1, split.rule="sl_midpt", shrink.rule="simple", verbose=TRUE, ...)
ann(ref, target, k=1, eps=0.0, tree.type="kd", search.type="standard", bucket.size=1, split.rule="sl_midpt", shrink.rule="simple", verbose=TRUE, ...)
ref |
an |
target |
an |
k |
defines the number of nearest neighbors to find. The default
is |
eps |
the |
tree.type |
the data structures kd-tree or bd-tree as quoted key words kd and bd, respectively. A brute force search can be specified with the quoted key word brute. If brute is specified, then all subsequent arguments are ignored. The default is the kd-tree. |
search.type |
either standard or priority search in the kd-tree or bd-tree, specified by quoted key words standard and priority, respectively. The default is the standard search. |
bucket.size |
the maximum number of reference points in the leaf nodes. The default is 1. |
split.rule |
is the strategy for the recursive splitting of those nodes with more points than the bucket size. The splitting rule applies to both the kd-tree and bd-tree. Splitting rule options are the quoted key words:
See supporting documentation, reference below, for a thorough description and discussion of these splitting rules. |
shrink.rule |
applies only to the bd-tree and is an additional
strategy (beyond the splitting rule) for the recursive partitioning
of nodes. This argument is ignored if
See supporting documentation, reference below, for a thorough description and discussion of these shrinking rules. |
verbose |
if true, search progress is printed to the screen. |
... |
currently no additional arguments. |
The ann
function calls portions of the Approximate Nearest
Neighbor Library, written by David M. Mount. All of the ann
function arguments are detailed in the ANN Programming Manual
found at https://www.cs.umd.edu/~mount/ANN/.
An object of class ann
, which is a list with some or all of
the following tags:
knnIndexDist |
an |
searchTime |
total search time, not including data structure construction, etc. |
k |
as defined in the |
eps |
as defined in the |
tree.type |
as defined in the |
search.type |
as defined in the |
bucket.size |
as defined in the |
split.rule |
as defined in the |
shrink.rule |
as defined in the |
Andrew O. Finley [email protected]
## Make a couple of bivariate normal classes rmvn <- function(n, mu=0, V = matrix(1)) { p <- length(mu) if(any(is.na(match(dim(V),p)))) stop("Dimension problem!") D <- chol(V) matrix(rnorm(n*p), ncol=p) %*% D + rep(mu,rep(n,p)) } m <- 10000 ## Class 1. mu.1 <- c(20, 40) V.1 <- matrix(c(-5,1,0,5),2,2); V.1 <- V.1%*%t(V.1) c.1 <- cbind(rmvn(m, mu.1, V.1), rep(1, m)) ## Class 2. mu.2 <- c(30, 60) V.2 <- matrix(c(4,2,0,2),2,2); V.2 <- V.2%*%t(V.2) c.2 <- cbind(rmvn(m, mu.2, V.2), rep(2, m)) ## Class 3. mu.3 <- c(15, 60) V.3 <- matrix(c(5,5,0,5),2,2); V.3 <- V.3%*%t(V.3) c.3 <- cbind(rmvn(m, mu.3, V.3), rep(3, m)) c.all <- rbind(c.1, c.2, c.3) max.x <- max(c.all[,1]); min.x <- min(c.all[,1]) max.y <- max(c.all[,2]); min.y <- min(c.all[,2]) ## Check them out. plot(c.1[,1], c.1[,2], xlim=c(min.x, max.x), ylim=c(min.y, max.y), pch=19, cex=0.5, col="blue", xlab="Variable 1", ylab="Variable 2") points(c.2[,1], c.2[,2], pch=19, cex=0.5, col="green") points(c.3[,1], c.3[,2], pch=19, cex=0.5, col="red") ## Take a reference sample. n <- 2000 ref <- c.all[sample(1:nrow(c.all), n),] ## Compare search times k <- 10 ## Do a simple brute force search. brute <- ann(ref=ref[,1:2], target=c.all[,1:2], tree.type="brute", k=k, verbose=FALSE) print(brute$searchTime) ## Do an exact kd-tree search. kd.exact <- ann(ref=ref[,1:2], target=c.all[,1:2], tree.type="kd", k=k, verbose=FALSE) print(kd.exact$searchTime) ## Do an approximate kd-tree search. kd.approx <- ann(ref=ref[,1:2], target=c.all[,1:2], tree.type="kd", k=k, eps=100, verbose=FALSE) print(kd.approx$searchTime) ## Takes too long to calculate for this many targets. ## Compare overall accuracy of the exact vs. approximate search ##knn.mode <- function(knn.indx, ref){ ## x <- ref[knn.indx,] ## as.numeric(names(sort(as.matrix(table(x))[,1], ## decreasing=TRUE))[1]) ##}
## Make a couple of bivariate normal classes rmvn <- function(n, mu=0, V = matrix(1)) { p <- length(mu) if(any(is.na(match(dim(V),p)))) stop("Dimension problem!") D <- chol(V) matrix(rnorm(n*p), ncol=p) %*% D + rep(mu,rep(n,p)) } m <- 10000 ## Class 1. mu.1 <- c(20, 40) V.1 <- matrix(c(-5,1,0,5),2,2); V.1 <- V.1%*%t(V.1) c.1 <- cbind(rmvn(m, mu.1, V.1), rep(1, m)) ## Class 2. mu.2 <- c(30, 60) V.2 <- matrix(c(4,2,0,2),2,2); V.2 <- V.2%*%t(V.2) c.2 <- cbind(rmvn(m, mu.2, V.2), rep(2, m)) ## Class 3. mu.3 <- c(15, 60) V.3 <- matrix(c(5,5,0,5),2,2); V.3 <- V.3%*%t(V.3) c.3 <- cbind(rmvn(m, mu.3, V.3), rep(3, m)) c.all <- rbind(c.1, c.2, c.3) max.x <- max(c.all[,1]); min.x <- min(c.all[,1]) max.y <- max(c.all[,2]); min.y <- min(c.all[,2]) ## Check them out. plot(c.1[,1], c.1[,2], xlim=c(min.x, max.x), ylim=c(min.y, max.y), pch=19, cex=0.5, col="blue", xlab="Variable 1", ylab="Variable 2") points(c.2[,1], c.2[,2], pch=19, cex=0.5, col="green") points(c.3[,1], c.3[,2], pch=19, cex=0.5, col="red") ## Take a reference sample. n <- 2000 ref <- c.all[sample(1:nrow(c.all), n),] ## Compare search times k <- 10 ## Do a simple brute force search. brute <- ann(ref=ref[,1:2], target=c.all[,1:2], tree.type="brute", k=k, verbose=FALSE) print(brute$searchTime) ## Do an exact kd-tree search. kd.exact <- ann(ref=ref[,1:2], target=c.all[,1:2], tree.type="kd", k=k, verbose=FALSE) print(kd.exact$searchTime) ## Do an approximate kd-tree search. kd.approx <- ann(ref=ref[,1:2], target=c.all[,1:2], tree.type="kd", k=k, eps=100, verbose=FALSE) print(kd.approx$searchTime) ## Takes too long to calculate for this many targets. ## Compare overall accuracy of the exact vs. approximate search ##knn.mode <- function(knn.indx, ref){ ## x <- ref[knn.indx,] ## as.numeric(names(sort(as.matrix(table(x))[,1], ## decreasing=TRUE))[1]) ##}
Some of the nearest neighbors found using yai
or
newtargets
are removed using this function. This is possible when there
are several reference observations for each target as is the case with k>1.
The function removes neighbor reference observations for a given target if the reference
and target are in (a) the same group or (b) from different
groups, depending on the method
used. Group membership is identified for
reference and target observations using two vectors, refGroups
for references
and trgGroups
for targets. If the group membership code is the same for a
refernece and a target, then they are in the same group while different codes mean
a lack of common group membership.
applyMask(object,refGroups=NULL, trgGroups=NULL, method="removeWhenCommon", k=1)
applyMask(object,refGroups=NULL, trgGroups=NULL, method="removeWhenCommon", k=1)
object |
an object of class |
refGroups |
a vector, with length equal to the number of reference observations, of codes that indicate group membership. |
trgGroups |
a vector, with length equal to the number of target observations,
of codes that indicate group membership. The data type and coding scheme of |
method |
is the strategy used for removing neighbors from the
|
k |
the number of nearest neighbors to keep. |
An object of class yai
, that is a copy of the first argument with the
following elements replaced:
call |
the call. |
neiDstTrgs |
a matrix of distances between a target (identified by its row name) and the k references. There are k columns. |
neiIdsTrgs |
a matrix of reference identifications that correspond to neiDstTrgs. |
neiDstRefs |
set NULL as if |
neiIdsRefs |
set NULL as if |
noRefs |
set TRUE regardless of original value. |
k |
the value of k. |
Nicholas L. Crookston [email protected]
Acknowledgment: This function was inspired by correspondence with Clara Anton Fernandez.
require (yaImpute) data(iris) # build a base case, there are no targets, # turn off getting references neighbors. mal <- yai(x=iris[,-5],method="mahalanobis", noRefs = TRUE) # create a new data, just a copy of the old with new row names. iris2 <- iris rownames(iris2) <- paste0("new.",rownames(iris)) # do an imputation with k=55 m55 <- newtargets(mal,newdata=iris2,k=55) # get the 2 closest where the species codes don't match by # removing neighbors when the ref group membership is # in common with the target group membership (same species), # thereby forcing neighbors to be from different species. # in this case, the groups are species codes. applyMask(m55,refGroups=iris$Species,trgGroups=iris2$Species, method="removeWhenCommon",k=2) # get the 2 closest where the species codes do match by # removing neighbors when the ref group membership is # different than the target group membership (different species), # thereby forcing neighbors to be from the same species (this # is generally true anyway using the iris data). applyMask(m55,iris$Species,trgGroups=iris2$Species, method="keepWhenCommon",k=2)
require (yaImpute) data(iris) # build a base case, there are no targets, # turn off getting references neighbors. mal <- yai(x=iris[,-5],method="mahalanobis", noRefs = TRUE) # create a new data, just a copy of the old with new row names. iris2 <- iris rownames(iris2) <- paste0("new.",rownames(iris)) # do an imputation with k=55 m55 <- newtargets(mal,newdata=iris2,k=55) # get the 2 closest where the species codes don't match by # removing neighbors when the ref group membership is # in common with the target group membership (same species), # thereby forcing neighbors to be from different species. # in this case, the groups are species codes. applyMask(m55,refGroups=iris$Species,trgGroups=iris2$Species, method="removeWhenCommon",k=2) # get the 2 closest where the species codes do match by # removing neighbors when the ref group membership is # different than the target group membership (different species), # thereby forcing neighbors to be from the same species (this # is generally true anyway using the iris data). applyMask(m55,iris$Species,trgGroups=iris2$Species, method="keepWhenCommon",k=2)
AsciiGridImpute
finds nearest neighbor reference
observations for each point in the input grid maps and outputs maps
of selected Y-variables in a corresponding set of output grid maps.
AsciiGridPredict
applies a predict function to each point in the
input grid maps and outputs maps of the prediction(s) in corresponding
output grid maps (see Details).
One row of each grid map is read and processed at a time thereby avoiding the need to build huge objects in R that would be necessary if all the rows of all the maps were processed together.
AsciiGridImpute(object,xfiles,outfiles,xtypes=NULL,ancillaryData=NULL, ann=NULL,lon=NULL,lat=NULL,rows=NULL,cols=NULL, nodata=NULL,myPredFunc=NULL,...) AsciiGridPredict(object,xfiles,outfiles,xtypes=NULL,lon=NULL,lat=NULL, rows=NULL,cols=NULL,nodata=NULL,myPredFunc=NULL,...)
AsciiGridImpute(object,xfiles,outfiles,xtypes=NULL,ancillaryData=NULL, ann=NULL,lon=NULL,lat=NULL,rows=NULL,cols=NULL, nodata=NULL,myPredFunc=NULL,...) AsciiGridPredict(object,xfiles,outfiles,xtypes=NULL,lon=NULL,lat=NULL, rows=NULL,cols=NULL,nodata=NULL,myPredFunc=NULL,...)
object |
An object of class |
xfiles |
A |
outfiles |
One of these two forms:
|
xtypes |
A list of data type names that corresponds exactly to data type of the
maps listed in |
ancillaryData |
A data frame of Y-variables that may not have been used in
the original call to |
ann |
if NULL, the value is taken from |
lon |
if NULL, the value of |
lat |
if NULL, the value of |
rows |
if NULL, all rows from the input grids are used. Otherwise, rows is a 2-element
vector given the rows desired for the output. If the second element is greater than
the number of rows, the header value |
cols |
if NULL, all columns from the input grids are used. Otherwise, cols is a 2-element
vector given the columns desired for the output. If the first element is greater than
one, the header value |
nodata |
the |
myPredFunc |
called by |
... |
passed to |
The input maps are assumed to be Asciigrid maps with 6-line headers
containing the following tags: NCOLS, NROWS, XLLCORNER, YLLCORNER,
CELLSIZE
and NODATA_VALUE
(case insensitive). The headers should be
identical for all input maps, a warning is issued if they are not.
It is critical that NODATA_VALUE
is the same on all input maps.
The function builds data frames from the input maps one row at a time and builds predictions using those data frames as newdata. Each row of the input maps is processed in sequence so that the entire maps are not stored in memory. The function works by opening all the input and reads one line (row) at a time from each. The output file(s) are created one line at time as the input maps are processed.
Use AsciiGridImpute
for objects builds with yai
,
otherwise use AsciiGridPredict
. When AsciiGridPredict
is
used, the following rules apply. First, when myPredFunc
is not
null it is called with the arguments object, newdata, ...
where the
new data is the data frame built from the input maps, otherwise the
generic predict
function is called with these same arguments.
When object
and myPredFunc
are both NULL a copy
newdata
used as the prediction. This is useful when lat, lon, rows,
or cols
are used in to subset the maps.
The NODATA_VALUE
is output for every NODATA_VALUE
found on any
grid cell on any one of the input maps (the predict function is not called for
these grid cells). NODATA_VALUE
is also output for any grid cell where
the predict function returns an NA
.
If factors are used as X-variables in
object
, the levels found the map data are checked against those used in
building the object
. If new levels are found, the corresponding output
map grid point is set to NODATA_VALUE
; the predict function is not called
for these cells as most predict functions will fail in these circumstances.
Checking on factors depends on object
containing a meaningful member
named xlevels
, as done for objects produced by lm
.
Asciigrid maps do not contain character data, only numbers. The numbers in the
maps are matched the xlevels
by subscript (the first entry in a level corresponds
to the numeric value 1 in the Asciigrid maps, the second to the number 2 and so
on). Care must be taken by the user to insure that the coding scheme used in
building the maps is identical to that used in building the object
. See Value for
information on how you can check the matching of these codes.
An invisible
list containing the following named elements:
unexpectedNAs |
A data frame listing the map row numbers and the number
of |
illegalLevels |
A data frame listing levels found in the maps that
were not found in the |
outputLegend |
A data frame showing the relationship between levels in
the output maps and those found in |
inputLegend |
A data frame showing the relationship between levels found in
the input maps and those found in |
Nicholas L. Crookston [email protected]
yai
, impute
, and newtargets
## These commands write new files to your working directory # Use the iris data data(iris) # Section 1: Imagine that the iris are planted in a planting bed. # The following set of commands create Asciigrid map # files for four attributes to illustrate the planting layout. # Change species from a character factor to numeric (the sp classes # can not handle character data). sLen <- matrix(iris[,1],10,15) sWid <- matrix(iris[,2],10,15) pLen <- matrix(iris[,3],10,15) pWid <- matrix(iris[,4],10,15) spcd <- matrix(as.numeric(iris[,5]),10,15) # Create and change to a temp directory. You can delete these steps # if you wish to keep the files in your working directory. curdir <- getwd() setwd(tempdir()) cat ("Using working dir",getwd(),"\n") # Make maps of each variable. header = c("NCOLS 15","NROWS 10","XLLCORNER 1","YLLCORNER 1", "CELLSIZE 1","NODATA_VALUE -9999") cat(file="slen.txt",header,sep="\n") cat(file="swid.txt",header,sep="\n") cat(file="plen.txt",header,sep="\n") cat(file="pwid.txt",header,sep="\n") cat(file="spcd.txt",header,sep="\n") write.table(sLen,file="slen.txt",append=TRUE,col.names=FALSE, row.names=FALSE) write.table(sWid,file="swid.txt",append=TRUE,col.names=FALSE, row.names=FALSE) write.table(pLen,file="plen.txt",append=TRUE,col.names=FALSE, row.names=FALSE) write.table(pWid,file="pwid.txt",append=TRUE,col.names=FALSE, row.names=FALSE) write.table(spcd,file="spcd.txt",append=TRUE,col.names=FALSE, row.names=FALSE) # Section 2: Create functions to predict species # set the random number seed so that example results are consistant # normally, leave out this command set.seed(12345) # sample the data refs <- sample(rownames(iris),50) y <- data.frame(Species=iris[refs,5],row.names=rownames(iris[refs,])) # build a yai imputation for the reference data. rfNN <- yai(x=iris[refs,1:4],y=y,method="randomForest") # make lists of input and output map files. xfiles <- list(Sepal.Length="slen.txt",Sepal.Width="swid.txt", Petal.Length="plen.txt",Petal.Width="pwid.txt") outfiles1 <- list(distance="dist.txt",Species="spOutrfNN.txt", useid="useindx.txt") # map the imputation-based predictions for the input maps AsciiGridImpute(rfNN,xfiles,outfiles1,ancillaryData=iris) # read the asciigrids and get them ready to plot spOrig <- t(as.matrix(read.table("spcd.txt",skip=6))) sprfNN <- t(as.matrix(read.table("spOutrfNN.txt",skip=6))) dist <- t(as.matrix(read.table("dist.txt",skip=6))) # demonstrate the use of useid: spViaUse <- read.table("useindx.txt",skip=6) for (col in colnames(spViaUse)) spViaUse[,col]=as.character(y$Species[spViaUse[,col]]) # demonstrate how to use factors: spViaLevels <- read.table("spOutrfNN.txt",skip=6) for (col in colnames(spViaLevels)) spViaLevels[,col]=levels(y$Species)[spViaLevels[,col]] identical(spViaLevels,spViaUse) if (require(randomForest)) { # build a randomForest predictor rf <- randomForest(x=iris[refs,1:4],y=iris[refs,5]) AsciiGridPredict(rf,xfiles,list(predict="spOutrf.txt")) sprf <- t(as.matrix(read.table("spOutrf.txt",skip=6))) } else sprf <- NULL # reset the directory to that where the example was started. setwd(curdir) par(mfcol=c(2,2),mar=c(1,1,2,1)) image(spOrig,main="Original",col=c("red","green","blue"), axes=FALSE,useRaster=TRUE) image(sprfNN,main="Using Impute",col=c("red","green","blue"), axes=FALSE,useRaster=TRUE) if (!is.null(sprf)) image(sprf,main="Using Predict",col=c("red","green","blue"), axes=FALSE,useRaster=TRUE) image(dist,main="Neighbor Distances",col=terrain.colors(15), axes=FALSE,useRaster=TRUE)
## These commands write new files to your working directory # Use the iris data data(iris) # Section 1: Imagine that the iris are planted in a planting bed. # The following set of commands create Asciigrid map # files for four attributes to illustrate the planting layout. # Change species from a character factor to numeric (the sp classes # can not handle character data). sLen <- matrix(iris[,1],10,15) sWid <- matrix(iris[,2],10,15) pLen <- matrix(iris[,3],10,15) pWid <- matrix(iris[,4],10,15) spcd <- matrix(as.numeric(iris[,5]),10,15) # Create and change to a temp directory. You can delete these steps # if you wish to keep the files in your working directory. curdir <- getwd() setwd(tempdir()) cat ("Using working dir",getwd(),"\n") # Make maps of each variable. header = c("NCOLS 15","NROWS 10","XLLCORNER 1","YLLCORNER 1", "CELLSIZE 1","NODATA_VALUE -9999") cat(file="slen.txt",header,sep="\n") cat(file="swid.txt",header,sep="\n") cat(file="plen.txt",header,sep="\n") cat(file="pwid.txt",header,sep="\n") cat(file="spcd.txt",header,sep="\n") write.table(sLen,file="slen.txt",append=TRUE,col.names=FALSE, row.names=FALSE) write.table(sWid,file="swid.txt",append=TRUE,col.names=FALSE, row.names=FALSE) write.table(pLen,file="plen.txt",append=TRUE,col.names=FALSE, row.names=FALSE) write.table(pWid,file="pwid.txt",append=TRUE,col.names=FALSE, row.names=FALSE) write.table(spcd,file="spcd.txt",append=TRUE,col.names=FALSE, row.names=FALSE) # Section 2: Create functions to predict species # set the random number seed so that example results are consistant # normally, leave out this command set.seed(12345) # sample the data refs <- sample(rownames(iris),50) y <- data.frame(Species=iris[refs,5],row.names=rownames(iris[refs,])) # build a yai imputation for the reference data. rfNN <- yai(x=iris[refs,1:4],y=y,method="randomForest") # make lists of input and output map files. xfiles <- list(Sepal.Length="slen.txt",Sepal.Width="swid.txt", Petal.Length="plen.txt",Petal.Width="pwid.txt") outfiles1 <- list(distance="dist.txt",Species="spOutrfNN.txt", useid="useindx.txt") # map the imputation-based predictions for the input maps AsciiGridImpute(rfNN,xfiles,outfiles1,ancillaryData=iris) # read the asciigrids and get them ready to plot spOrig <- t(as.matrix(read.table("spcd.txt",skip=6))) sprfNN <- t(as.matrix(read.table("spOutrfNN.txt",skip=6))) dist <- t(as.matrix(read.table("dist.txt",skip=6))) # demonstrate the use of useid: spViaUse <- read.table("useindx.txt",skip=6) for (col in colnames(spViaUse)) spViaUse[,col]=as.character(y$Species[spViaUse[,col]]) # demonstrate how to use factors: spViaLevels <- read.table("spOutrfNN.txt",skip=6) for (col in colnames(spViaLevels)) spViaLevels[,col]=levels(y$Species)[spViaLevels[,col]] identical(spViaLevels,spViaUse) if (require(randomForest)) { # build a randomForest predictor rf <- randomForest(x=iris[refs,1:4],y=iris[refs,5]) AsciiGridPredict(rf,xfiles,list(predict="spOutrf.txt")) sprf <- t(as.matrix(read.table("spOutrf.txt",skip=6))) } else sprf <- NULL # reset the directory to that where the example was started. setwd(curdir) par(mfcol=c(2,2),mar=c(1,1,2,1)) image(spOrig,main="Original",col=c("red","green","blue"), axes=FALSE,useRaster=TRUE) image(sprfNN,main="Using Impute",col=c("red","green","blue"), axes=FALSE,useRaster=TRUE) if (!is.null(sprf)) image(sprf,main="Using Predict",col=c("red","green","blue"), axes=FALSE,useRaster=TRUE) image(dist,main="Neighbor Distances",col=terrain.colors(15), axes=FALSE,useRaster=TRUE)
The number of best variables is estimated by finding an
apparent inflection point in the relationship between the
generalized root mean square distance (see grmsd
and the number
of X-variables.
bestVars(obj,nbest=NULL)
bestVars(obj,nbest=NULL)
obj |
an object create by |
nbest |
number of variables designated as the best; if null the number is estimated |
An character vector of variable names in decreasing order of importance.
Nicholas L. Crookston [email protected]
require(yaImpute) data(iris) set.seed(12345) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[,3:4] # Petal.Length Petal.Width vsel <- varSelection(x=x,y=y,nboot=5,useParallel=FALSE) bestVars(vsel)
require(yaImpute) data(iris) set.seed(12345) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[,3:4] # Petal.Length Petal.Width vsel <- varSelection(x=x,y=y,nboot=5,useParallel=FALSE) bestVars(vsel)
Several objects of class yai
are combined into a new
object forming a consensus among the many. The intention is that the many would
be formed by running yai several times with bootstrap=TRUE
or by varying
other options.
buildConsensus(reps, noTrgs=FALSE, noRefs=FALSE, k=NULL)
buildConsensus(reps, noTrgs=FALSE, noRefs=FALSE, k=NULL)
reps |
a list of objects class |
noTrgs |
If |
noRefs |
If |
k |
If not specified, the minimum value of |
An object of class yai
Nicholas L. Crookston [email protected]
John Coulston [email protected]
require (yaImpute) data(iris) set.seed(123) # form some test data, y's are defined only for reference # observations. refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width reps <- replicate(20, yai(x=x,y=y,method="msn",bootstrap=TRUE,k=2), simplify=FALSE) buildConsensus(reps)
require (yaImpute) data(iris) set.seed(123) # form some test data, y's are defined only for reference # observations. refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width reps <- replicate(20, yai(x=x,y=y,method="msn",bootstrap=TRUE,k=2), simplify=FALSE) buildConsensus(reps)
Provides a convenient display of the root mean square differences
(see rmsd.yai
) or correlations (see cor.yai
) between observed and
imputed values for each of several imputations. Each column of the returned
data frame corresponds to an imputation result and each row corresponds to a
variable.
compare.yai(...,ancillaryData=NULL,vars=NULL,method="rmsd",scale=TRUE)
compare.yai(...,ancillaryData=NULL,vars=NULL,method="rmsd",scale=TRUE)
... |
a list of objects created by |
ancillaryData |
a data frame that defines new variables, passed to |
vars |
a list of variable names you want to include; if NULL all available variables are included. |
method |
when rmsd is specified, the comparison is based on root mean
square differences between observed an imputed, and |
scale |
passed to |
A data.frame of class c("compare.yai","data.frame")
, where the columns
are the names of the ...-arguments and the rows are a union of
variable names. NA's are returned when the variables are factors. The scale values (if used)
are returned as an attribute (all if some are different than others, a warning is issued).
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
yai
, plot.compare.yai
,
impute.yai
, rmsd.yai
require(yaImpute) data(iris) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width # build yai objects using 2 methods msn <- yai(x=x,y=y) mal <- yai(x=x,y=y,method="mahalanobis") # compare the y variables compare.yai(msn,mal) # compare the all variables in iris compare.yai(msn,mal,ancillaryData=iris) # Species is a factor, no comparison is made
require(yaImpute) data(iris) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width # build yai objects using 2 methods msn <- yai(x=x,y=y) mal <- yai(x=x,y=y,method="mahalanobis") # compare the y variables compare.yai(msn,mal) # compare the all variables in iris compare.yai(msn,mal,ancillaryData=iris) # Species is a factor, no comparison is made
Computes the correlation between observed and imputed values for each observation that has both.
cor.yai (object,vars=NULL,...)
cor.yai (object,vars=NULL,...)
object |
an object created by |
vars |
a list of variables names you want to include, if NULL all available variables are included. |
... |
passed to called methods (not currently used) |
The correlations are computed using cor.yai
. For data imputation, such
correlations are likely not appropriate (Stage and Crookston 2007).
A data frame with the row names as vars and the column as cor
.
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
Stage, A.R.; Crookston, N.L. (2007). Partitioning error components for accuracy-assessment of near neighbor methods of imputation. For. Sci. 53(1):62-72. https://academic.oup.com/forestscience/article/53/1/62/4604364
Change the neighbor selections in a yai
object such that
bias (if any) in the average value of an expression
of one or more variables is reduced to be within a defined
confidence interval.
correctBias(object,trgVal,trgValCI=NULL,nStdev=1.5,excludeRefIds=NULL,trace=FALSE)
correctBias(object,trgVal,trgValCI=NULL,nStdev=1.5,excludeRefIds=NULL,trace=FALSE)
object |
an object of class |
trgVal |
an |
trgValCI |
The confidence interval that should contain the |
nStdev |
the number of standard deviations in the vector of values used to
compute the confidence interval when one is computed, ignored if |
excludeRefIds |
identities of reference observations to exclude from the population, if coded as "all" then all references are excluded (see details). |
trace |
if |
Imputation as it is defined in yaImpute
can yield biased
results. Lets say that you have a collection of reference
observations that happen to be selected in a non-biased way among a
population. In this discussion, population is a finite set of all
individual sample units of interest; the reference plus target observations
often represent this population (but this need not be true, see below).
If the average of a measured attribute is computed from
this random sample, it is an unbiased estimate of the true mean.
Using yai
, while setting k=1, values for each of several
attributes are imputed from a single reference observation to a target
observation. Once the imputation is done over all the target observations, an
average of any one measured attribute can be computed over all the observations
in the population. There is no guarantee that this average will be within a
pre-specified confidence interval.
Experience shows that despite any lack of guarantee, the results are accurate
(not biased). This tends to hold true when the reference data contains
samples that cover the variation of the targets, even when they are not a
random sample, and even if some of the reference observations are from sample
units that are outside the target population.
Because there is no guarantee, and because the reference observations might
profitably come from sample units beyond those in the population
(so as to insure all kinds of targets have a matching reference), it is necessary
to test the imputation results for bias. If bias is found, it would be helpful to
do something to correct it.
The correctBias()
function is designed to check for, and correct
discovered bias by selecting alternative nearby reference observations
to be imputed to targets that contribute to the bias. The idea is that even if
one reference is closest to a target, its attribute(s) of interest might
be greater (or less) than the mean. An alternative neighbor, one that may be
almost as close, might reduce the overall bias if it were used instead. If this is
the case, correctBias()
switches the neighbor selections.
It makes as many switches as it can until the mean among the population
targets falls within the specified confidence interval.
There is no guarantee that the goal will be met.
The details of the method are:
1. An attribute of interest is established by naming one in the call with
argument tarVal
. Note that this can be a simple variable name enclosed
in quotations marks or it can be an expression
of one or more variables.
If the former, it is converted into an expression that is executed in the
environment of the reference
observations (both the X- and Y-variables).
A confidence interval is computed for this value under the assumption that the
reference observations are an unbiased sample of the target population. This may
not be the case. Regardless, a confidence interval is necessary and it can
alternatively be supplied using trgValCI
.
2. One of several possible passes through the data are taken to find neighbor switches
that will result in the bias being corrected. A pass includes computing the
attribute of interest by applying the expression
to values imputed to all
the targets, under the assumption that the next neighbor is used
in place of the currently used neighbor. This computation results in a
vector with one element for each target observation that
measures the contribution toward reducing the bias that would be made if
a switch were made. The target observations are then ordered
into increasing order of how much the distance from the currently selected
reference would increase if the switch were to take place. Enough switches are
made in this order to correct the bias.
If the bias is not corrected by the first pass, another pass is done
using the next neighbor(s).
The number of possible passes is equal to k-1 where k is
set in the original call to yai
. Note that switches are made
among targets only, and never among reference observations that may make up the
population. That is, reference observations are always left
to represent themselves with k=1
.
3. Here are details of the argument excludeRefIds
. When computing the mean
of the attribute of interest (using the expression
),
correctBias()
must know which observations represent the
population. Normally, all the target observations would be in this set, but perhaps
not all of the reference observations. When excludeRefIds
is left NULL, the population is made of all reference and all
target observations. Reference observations that should be left out
of the calculations because they are not part of the population can be specified
using the excludeRefIds
argument as a vector of character strings identifing
the rownames to leave out, or a vector of row numbers that identify the row numbers to
leave out. If excludeRefIds="all"
, all reference observations are excluded.
An object of class yai
where k = 1
and the neighbor selections
have been changed as described above. In addition, the call
element is changed
to show both the original call to yai
and the call to this function.
A new list called biasParameters
is added to the yai
object
with these tags:
trgValCI |
the target CI. |
curVal |
the value of the bias that was achieved. |
npasses |
the number of passes through the data taken to achieve the result. |
oldk |
the old value of |
Nicholas L. Crookston [email protected]
data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build an msn run, first build dummy variables for species. sp1 <- as.integer(iris$Species=="setosa") sp2 <- as.integer(iris$Species=="versicolor") y2 <- data.frame(cbind(iris[,4],sp1,sp2),row.names=rownames(iris)) y2 <- y2[refs,] names(y2) <- c("Petal.Width","Sp1","Sp2") # find 5 refernece neighbors for each target msn <- yai(x=x,y=y2,method="msn",k=5) # check for and correct for bias in mean "Petal.Width". Neighbor # selections will be changed as needed to bring the imputed values # into line with the CI. In this case, no changes are made (npasses # returns as zero). msnCorr = correctBias(msn,trgVal="Petal.Width") msnCorr$biasParameters
data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build an msn run, first build dummy variables for species. sp1 <- as.integer(iris$Species=="setosa") sp2 <- as.integer(iris$Species=="versicolor") y2 <- data.frame(cbind(iris[,4],sp1,sp2),row.names=rownames(iris)) y2 <- y2[refs,] names(y2) <- c("Petal.Width","Sp1","Sp2") # find 5 refernece neighbors for each target msn <- yai(x=x,y=y2,method="msn",k=5) # check for and correct for bias in mean "Petal.Width". Neighbor # selections will be changed as needed to bring the imputed values # into line with the CI. In this case, no changes are made (npasses # returns as zero). msnCorr = correctBias(msn,trgVal="Petal.Width") msnCorr$biasParameters
Several objects of class impute.yai
or yai
are combined by computing the mean, median, or mode of separate, individual imputations.
The intention is that the members of the first argument would be formed by running
yai several times with bootstrap=TRUE
or by varying other options.
ensembleImpute(imputes, method="mean",...)
ensembleImpute(imputes, method="mean",...)
imputes |
a list of objects class |
method |
when "mean", the continuous variables are averaged using |
... |
passed to |
An object of class c("impute.yai","data.frame")
, see impute.yai
.
The attributes of the data.frame include the following:
sd - A data.frame of standard deviations for continuous variables if there are any. The columns are not reported if the standard deviation is zero for all observations which is typically true of "observed" values.
N - the number of replications used to compute the corresponding data;
reported only if the number differs from the total number of replications. This
will be the case when bootstrap
, sampleVar
, or both are used in
yai
.
methods - the method used for each variable.
Nicholas L. Crookston [email protected]
John Coulston [email protected]
require (yaImpute) data(iris) set.seed(123) # form some test data, y's are defined only for reference # observations. refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width reps <- replicate(10, yai(x=x,y=y,method="msn",bootstrap=TRUE,k=2), simplify=FALSE) ensembleImpute(reps,ancillaryData=iris)
require (yaImpute) data(iris) set.seed(123) # form some test data, y's are defined only for reference # observations. refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width reps <- replicate(10, yai(x=x,y=y,method="msn",bootstrap=TRUE,k=2), simplify=FALSE) ensembleImpute(reps,ancillaryData=iris)
Error properties of estimates derived from imputation differ from those of regression-based estimates because the two methods include a different mix of error components. This function computes a partitioning of error statistics as proposed by Stage and Crookston (2007).
errorStats(mahal,...,scale=FALSE,pzero=0.1,plg=0.5,seeMethod="lm")
errorStats(mahal,...,scale=FALSE,pzero=0.1,plg=0.5,seeMethod="lm")
mahal |
An object of class |
... |
Other objects of class |
scale |
When |
pzero |
The lower tail p-value used to pick reference observations that are zero
distance from each other (used to compute |
plg |
The upper tail p-value used to pick reference observations that are
substantially distant from each other (used to compute |
seeMethod |
Method used to compute |
See https://academic.oup.com/forestscience/article/53/1/62/4604364
A list that contains several data frames. The column names of each are a combination of the name of the object used to compute the statistics and the name of the statistic. The rownames correspond the the Y-variables from the first argument. The data frame names are as follows:
common |
statistics used to compute other statistics. |
name of first argument |
error statistics for the first |
names of ... arguments |
error statistics for each of the remaining |
see |
standard error of estimate for individual regressions fit for corresponding Y-variables. |
rmmsd0 |
root mean square difference for imputations based on |
mlf |
square root of the model lack of fit: |
rmsd |
root mean square error. |
rmsdlg |
root mean square error of the observations with larger distances. |
sei |
standard error of imputation |
dstc |
distance component: |
Note that unlike Stage and Crookston (2007), all statistics reported here are in the natural units, not squared units.
Nicholas L. Crookston [email protected]
Albert R. Stage
Stage, A.R.; Crookston, N.L. (2007). Partitioning error components for accuracy-assessment of near neighbor methods of imputation. For. Sci. 53(1):62-72. https://academic.oup.com/forestscience/article/53/1/62/4604364
require (yaImpute) data(TallyLake) diag(cov(TallyLake[,1:8])) # see col A in Table 3 in Stage and Crookston mal=yai(x=TallyLake[,9:29],y=TallyLake[,1:8], noTrgs=TRUE,method="mahalanobis") msn=yai(x=TallyLake[,9:29],y=TallyLake[,1:8], noTrgs=TRUE,method="msn") # variable "see" for "mal" matches col B (when squared and scaled) # other columns don't match exactly as Stage and Crookston used different # software to compute values errorStats(mal,msn)
require (yaImpute) data(TallyLake) diag(cov(TallyLake[,1:8])) # see col A in Table 3 in Stage and Crookston mal=yai(x=TallyLake[,9:29],y=TallyLake[,1:8], noTrgs=TRUE,method="mahalanobis") msn=yai(x=TallyLake[,9:29],y=TallyLake[,1:8], noTrgs=TRUE,method="msn") # variable "see" for "mal" matches col B (when squared and scaled) # other columns don't match exactly as Stage and Crookston used different # software to compute values errorStats(mal,msn)
Provides a matrix of all observations with the reference observation identification best used to represent it, followed by the distance.
foruse(object,kth=NULL,method="kth",targetsOnly=FALSE)
foruse(object,kth=NULL,method="kth",targetsOnly=FALSE)
object |
an object created by |
kth |
when |
method |
the method used to select references to represent observations,
as follows: |
targetsOnly |
when is TRUE, reporting of references is not done. |
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
require(yaImpute) data(iris) # form some test data set.seed(1234) refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build a yai object using mahalanobis mal <- yai(x=x,y=y,method="mahalanobis",k=3) foruse(mal) # for references, use is equal to the rowname foruse(mal,kth=1) # for references, use is an row to the kth reference. # get all the choices: cbind(foruse(mal),foruse(mal,kth=1),foruse(mal,kth=2),foruse(mal,kth=3))
require(yaImpute) data(iris) # form some test data set.seed(1234) refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build a yai object using mahalanobis mal <- yai(x=x,y=y,method="mahalanobis",k=3) foruse(mal) # for references, use is equal to the rowname foruse(mal,kth=1) # for references, use is an row to the kth reference. # get all the choices: cbind(foruse(mal),foruse(mal,kth=1),foruse(mal,kth=2),foruse(mal,kth=3))
Computes the root mean square distance between predicted and corresponding
observed values in an orthogonal multivariate space. This value is the mean
Mahalanobis distance between observed and imputed values in a space defined by
observations and variables were observed and predicted values are defined.
The statistic provides a way to compare imputation (or prediction) results.
While it is designed to work with imputation, the function can be used with objects
that inherit from lm
or with matrices and data frames that
follow the column naming convention described in the details.
grmsd(...,ancillaryData=NULL,vars=NULL,wts=NULL,rtnVectors=FALSE,imputeMethod="closest")
grmsd(...,ancillaryData=NULL,vars=NULL,wts=NULL,rtnVectors=FALSE,imputeMethod="closest")
... |
objects created by any combination of
|
ancillaryData |
a data frame that defines variables, passed to
|
vars |
a list of variable names you want to include; if NULL all available
variables are included (note that if impute.yai the
Y-variables are returned when |
wts |
A vector of weights used to compute the mean distances, see details below. |
rtnVectors |
The vectors of individual distances are returned (see Value) rather than the mean Mahalanobis distance. |
imputeMethod |
passed as |
This function is designed to compute the root mean square distance between observed
and predicted observations over several variables at once. It is the Mahalanobis
distance between observed and predicted but the name emphasizes the similarities
to root mean square difference (or error, see rmsd
).
Here are some notable characteristics.
In the univariate case this function returns the same value as
rmsd
with scale=TRUE
. In that case
the root mean square difference is computed after scale
has been called on the variable.
Like rmsd
, grmsd
is zero if the imputed values are
exactly the same as the observed values over all variables.
Like rmsd
, grmsd
is ~1.0 when the mean of each
variable is imputed in place of a near neighbor (it would be exactly 1.0 if
the maximum likelihood estimate of the covariance were used rather than
the unbiased estimate – it approaches 1 as n gets large.)
This situation corresponds to regression where the slope is zero.
It indicates that the imputation error is, over all, the same as it
would be if the means of the variables were imputed rather than near
neighbors (it does not signal that the means were imputed).
Like rmsd
, values of grmsd > 1.0 indicate that, on average,
the errors in the imputation are greater than they would be if the mean
of the corresponding variables were imputed for each observation.
Note that individual rmsd
values can be computed even when
the variance of the variable is zero. In contrast, grmsd
can
only be computed in the situation where the observed data matrix is full rank.
Rank is determined using qr
and columns are removed from the
analysis to create this condition if necessary (with a warning).
Observed and predicted are matched using the column names. Column names
that have ".o
" are matched to those that do not. Columns that do not
have matching observed and imputed (predicted) values are ignored.
Several objects may be passed as "...". Function impute.yai
is
called for any objects that were created by yai
;
ancillaryData
and vars
are passed to impute.yai
when it is used.
When objects inherit from lm
, a suitable matrix is formed using
by calling the predict
and resid
functions.
Factors, if found, are removed (with a warning).
When argument wts
is defined there must be one value for each pair of
observed and predicted variables. If the values are named (preferred), then
the names are matched to the names of predicted variables (no .o
suffix).
The matched values effectively scale the axes in which distances are computed.
When this is done, the resulting distances are not Mahalanobis distances.
When rtnVectors=FALSE
, a sorted named vector of mean distances
is returned; the names are taken from the arguments.
When rtnVectors=TRUE
the function returns vectors of distances, sorted and
named as done wnen this argument is FALSE.
Nicholas L. Crookston [email protected]
yai
, impute.yai
, rmsd.yai
,
notablyDifferent
require(yaImpute) data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width # build yai objects using 2 methods msn <- yai(x=x,y=y) mal <- yai(x=x,y=y,method="mahalanobis") # compute the average distances between observed and imputed (predicted) grmsd(msn,mal,lmFit=lm(as.matrix(y) ~ ., data=x[refs,])) # use the all variables and observations in iris # Species is a factor and is automatically deleted with a warning grmsd(msn,mal,ancillaryData=iris) # here is an example using lm, and another using column # means as predictions. impMean <- y colnames(impMean) <- paste0(colnames(impMean),".o") impMean <- cbind(impMean,y) # set the predictions to the mean's of the variables impMean[,"Petal.Length"] <- mean(impMean[,"Petal.Length"]) impMean[,"Petal.Width"] <- mean(impMean[,"Petal.Width"]) grmsd(msn, mal, lmFit=lm(as.matrix(y) ~ ., data=x[refs,]), impMean ) # compare to using function rmsd (values match): msnimp <- na.omit(impute(msn)) grmsd(msnimp[,c("Petal.Length","Petal.Length.o")]) rmsd(msnimp[,c("Petal.Length","Petal.Length.o")],scale=TRUE) # these are multivariate cases and they don't match # because the covariance of the two variables is > 0. grmsd(msnimp) colSums(rmsd(msnimp,scale=TRUE))/2 # get the vectors and make a boxplot, identify outliers stats <- boxplot(grmsd(msn,mal,ancillaryData=iris[,-5],rtnVectors=TRUE), ylab="Mahalanobis distance") stats$out # 118 132 #2.231373 1.990961
require(yaImpute) data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width # build yai objects using 2 methods msn <- yai(x=x,y=y) mal <- yai(x=x,y=y,method="mahalanobis") # compute the average distances between observed and imputed (predicted) grmsd(msn,mal,lmFit=lm(as.matrix(y) ~ ., data=x[refs,])) # use the all variables and observations in iris # Species is a factor and is automatically deleted with a warning grmsd(msn,mal,ancillaryData=iris) # here is an example using lm, and another using column # means as predictions. impMean <- y colnames(impMean) <- paste0(colnames(impMean),".o") impMean <- cbind(impMean,y) # set the predictions to the mean's of the variables impMean[,"Petal.Length"] <- mean(impMean[,"Petal.Length"]) impMean[,"Petal.Width"] <- mean(impMean[,"Petal.Width"]) grmsd(msn, mal, lmFit=lm(as.matrix(y) ~ ., data=x[refs,]), impMean ) # compare to using function rmsd (values match): msnimp <- na.omit(impute(msn)) grmsd(msnimp[,c("Petal.Length","Petal.Length.o")]) rmsd(msnimp[,c("Petal.Length","Petal.Length.o")],scale=TRUE) # these are multivariate cases and they don't match # because the covariance of the two variables is > 0. grmsd(msnimp) colSums(rmsd(msnimp,scale=TRUE))/2 # get the vectors and make a boxplot, identify outliers stats <- boxplot(grmsd(msn,mal,ancillaryData=iris[,-5],rtnVectors=TRUE), ylab="Mahalanobis distance") stats$out # 118 132 #2.231373 1.990961
Imputes the observation for variables from a reference observation to a
target observation. Also, imputes a value for a reference from other
references. This practice is useful for validation (see yai
). Variables
not available in the original data may be imputed using argument ancillaryData
.
## S3 method for class 'yai' impute(object,ancillaryData=NULL,method="closest", method.factor=method,k=NULL,vars=NULL, observed=TRUE,...)
## S3 method for class 'yai' impute(object,ancillaryData=NULL,method="closest", method.factor=method,k=NULL,vars=NULL, observed=TRUE,...)
object |
an object of class |
ancillaryData |
a data frame of variables that may not have been used in
the original call to |
method |
the method used to compute the imputed values for continuous variables,
as follows: |
method.factor |
the method used to compute the imputed values for factors, as follows: |
k |
the number neighbors to use in averages, when NULL all present are used. |
vars |
a character vector of variables to impute, when NULL, the behaviour depends
on the value of |
observed |
when TRUE, columns are created for observed values (those from the target observations) as well as imputed values (those from the reference observations. |
... |
passed to other methods, currently not used. |
An object of class c("impute.yai","data.frame")
, with rownames
identifying observations and column names identifying variables. When
observed=TRUE additional columns are created with a suffix of
.o.
NA's fill columns of observed values when no
corresponding value is known, as in the case for Y-variables from
target observations.
Scale factors for each variable are
returned as an attribute (see attributes
).
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
Emilie Henderson [email protected]
require(yaImpute) data(iris) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build a yai object using mahalanobis mal <- yai(x=x,y=y,method="mahalanobis") # output a data frame of observed and imputed values # of all variables and observations. impute(mal) malImp=impute(mal,ancillaryData=iris) plot(malImp)
require(yaImpute) data(iris) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build a yai object using mahalanobis mal <- yai(x=x,y=y,method="mahalanobis") # output a data frame of observed and imputed values # of all variables and observations. impute(mal) malImp=impute(mal,ancillaryData=iris) plot(malImp)
Data used to compare the utility of discrete-return light detection and ranging (LiDAR) data and multispectral satellite imagery, and their integration, for modeling and mapping basal area and tree density across two diverse coniferous forest landscapes in north-central Idaho, USA.
data(MoscowMtStJoe)
data(MoscowMtStJoe)
A data frame with 165 rows and 64 columns:
Ground based measurements of trees:
ABGR_BA - Basal area of ABGR
ABLA_BA - Basal area of ABLA
ACGL_BA - Basal area of ACGL
BEOC_BA - Basal area of BEOC
LAOC_BA - Basal area of LAOC
PICO_BA - Basal area of PICO
PIEN_BA - Basal area of PIEN
PIMO_BA - Basal area of PIMO
PIPO_BA - Basal area of PIPO
POBA_BA - Basal area of POBA
POTR_BA - Basal area of POTR
PSME_BA - Basal area of PSME
SAEX_BA - Basal area of SAEX
THPL_BA - Basal area of THPL
TSHE_BA - Basal area of TSHE
TSME_BA - Basal area of TSME
UNKN_BA - Basal area of unknown species
Total_B - Basal area total over all species
ABGR_TD - Trees per ha of ABGR
ABLA_TD - Trees per ha of ABLA
ACGL_TD - Trees per ha of ACGL
BEOC_TD - Trees per ha of BEOC
LAOC_TD - Trees per ha of LAOC
PICO_TD - Trees per ha of PICO
PIEN_TD - Trees per ha of PIEN
PIMO_TD - Trees per ha of PIMO
PIPO_TD - Trees per ha of PIPO
POBA_TD - Trees per ha of POBA
POTR_TD - Trees per ha of POTR
PSME_TD - Trees per ha of PSME
SAEX_TD - Trees per ha of SAEX
THPL_TD - Trees per ha of THPL
TSHE_TD - Trees per ha of TSHE
TSME_TD - Trees per ha of TSME
UNKN_TD - Trees per ha of unknown species
Total_T - Trees per ha total over all species
Geographic Location, Slope and Aspect:
EASTING - UTM (Zone 11) easting at plot center
NORTHING - UTM (Zone 11) northing at plot center
ELEVATION - Mean elevation (m) above sea level over plot
SLPMEAN - Mean slope (percent) over plot
ASPMEAN - Mean aspect (degrees) over plot
Advanced Land Imager (ALI):
B1MEAN - Mean of 30 m ALI band 1 pixels intersecting plot
B2MEAN - Mean of 30 m ALI band 2 pixels intersecting plot
B3MEAN - Mean of 30 m ALI band 3 pixels intersecting plot
B4MEAN - Mean of 30 m ALI band 4 pixels intersecting plot
B5MEAN - Mean of 30 m ALI band 5 pixels intersecting plot
B6MEAN - Mean of 30 m ALI band 6 pixels intersecting plot
B7MEAN - Mean of 30 m ALI band 7 pixels intersecting plot
B8MEAN - Mean of 30 m ALI band 8 pixels intersecting plot
B9MEAN - Mean of 30 m ALI band 9 pixels intersecting plot
PANMEA - Mean of 10 m PAN band pixels intersecting plot
PANSTD - Standard deviation of 10 m PAN band pixels intersecting plot
LiDAR Intensity:
INTMEAN - Mean of 2 m intensity pixels intersecting plot
INTSTD - Standard deviation of 2 m intensity pixels intersecting plot
INTMIN - Minimum of 2 m intensity pixels intersecting plot
INTMAX - Maximum of 2 m intensity pixels intersecting plot
LiDAR Height:
HTMEAN - Mean of 6 m height pixels intersecting plot
HTSTD - Standard deviation of 6 m height pixels intersecting plot
HTMIN - Minimum of 6 m height pixels intersecting plot
HTMAX - Maximum of 6 m height pixels intersecting plot
LiDAR Canopy Cover:
CCMEAN - Mean of 6 m canopy cover pixels intersecting plot
CCSTD - Standard deviation of 6 m canopy cover pixels intersecting plot
CCMIN - Minimum of 6 m canopy cover pixels intersecting plot
CCMAX - Maximum of 6 m canopy cover pixels intersecting plot
Dr. Andrew T. Hudak
USDA Forest Service
Rocky Mountain Research Station
1221 South Main
Moscow, Idaho, USA 83843
Hudak, A.T.; Crookston, N.L.; Evans, J.S.; Falkowski, M.J.; Smith, A.M.S.; Gessler, P.E.; Morgan, P. (2006). Regression modeling and mapping of coniferous forest basal area and tree density from discrete-return lidar and multispectral satellite data. Can. J. Remote Sensing. 32(2):126-138. https://www.tandfonline.com/doi/abs/10.5589/m06-007
Provides a matrix of reference observations that are used most often as sources of imputation and a column of the counts. The observations are listed in sorted order, most often used first.
mostused(object,n=20,kth=NULL)
mostused(object,n=20,kth=NULL)
object |
(1) a data frame created by |
n |
the number of mostused in sorted order. |
kth |
passed to |
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
require(yaImpute) data(iris) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build a yai object using mahalanobis mal <- yai(x=x,y=y,method="mahalanobis") mostused(mal,kth=1)
require(yaImpute) data(iris) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build a yai object using mahalanobis mal <- yai(x=x,y=y,method="mahalanobis") mostused(mal,kth=1)
Finds nearest neighbor reference observations for a given set of target
observations using an established (see yai
) object. Intended use is to
facilitate breaking up large imputation problems (see AsciiGridImpute
).
newtargets(object,newdata,k=NULL,ann=NULL,nVec=NULL)
newtargets(object,newdata,k=NULL,ann=NULL,nVec=NULL)
object |
an object of class |
newdata |
a data frame or matrix of new targets for which neighbors are
are found. Must include at least the X-variables used in the original call to
|
k |
if NULL, the value is taken from |
ann |
if NULL, the value is taken from |
nVec |
If defined, specify the number of axes in the CCA |
An object of class yai
, that is a copy of the first argument with the
following elements replaced:
call |
the call. |
obsDropped |
a list of the row names for observations dropped for various reasons (missing data). |
trgRows |
a list of the row names for target observations as a subset of all observations. |
xall |
the X-variables for all observations. |
neiDstTrgs |
a matrix of distances between a target (identified by its row name) and the k references. There are k columns. |
neiIdsTrgs |
a matrix of reference identifications that correspond to neiDstTrgs. |
neiDstRefs |
set NULL as if |
neiIdsRefs |
set NULL as if |
k |
the value of k, replaced if changed. |
ann |
the value of the ann argument. |
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
require (yaImpute) data(iris) # set the random number seed so that example results are consistant # normally, leave out this command set.seed(12345) # form some test data refs=sample(rownames(iris),50) # just the reference observations x <- iris[refs,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build a yai object using mahalanobis mal <- yai(x=x,y=y,method="mahalanobis") # get imputations for the target observations (not references) malNew <- newtargets(mal,iris[!(rownames(iris) %in% rownames(x)),]) # output a data frame of observed and imputed values for # the observations that are not in the original yai object impute(malNew,vars=yvars(malNew)) # in this example, Y is not specified (not required for mahalanobis). mal2 <- yai(x=x,method="mahalanobis") identical(foruse(mal),foruse(mal2)) if (require(randomForest)) { # here, method randomForest's unsupervised classification is used (no Y). rf <- yai(x=x,method="randomForest") # now get imputations for the targets in the iris data (those that are # not references). rfNew <- newtargets(rf,iris[!(rownames(iris) %in% rownames(x)),]) }
require (yaImpute) data(iris) # set the random number seed so that example results are consistant # normally, leave out this command set.seed(12345) # form some test data refs=sample(rownames(iris),50) # just the reference observations x <- iris[refs,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build a yai object using mahalanobis mal <- yai(x=x,y=y,method="mahalanobis") # get imputations for the target observations (not references) malNew <- newtargets(mal,iris[!(rownames(iris) %in% rownames(x)),]) # output a data frame of observed and imputed values for # the observations that are not in the original yai object impute(malNew,vars=yvars(malNew)) # in this example, Y is not specified (not required for mahalanobis). mal2 <- yai(x=x,method="mahalanobis") identical(foruse(mal),foruse(mal2)) if (require(randomForest)) { # here, method randomForest's unsupervised classification is used (no Y). rf <- yai(x=x,method="randomForest") # now get imputations for the targets in the iris data (those that are # not references). rfNew <- newtargets(rf,iris[!(rownames(iris) %in% rownames(x)),]) }
This routine identifies observations with large errors as measured by scaled
root mean square error (see rmsd.yai
). A threshold
is used to detect observations with large differences.
notablyDifferent(object,vars=NULL,threshold=NULL,p=.05,...)
notablyDifferent(object,vars=NULL,threshold=NULL,p=.05,...)
object |
an object of class |
vars |
a vector of character strings naming the variables to use, if
null the X-variables form |
threshold |
a threshold that if exceeded the observations are listed as notably different. |
p |
|
... |
additional arguments passed to |
The scaled differences are computed a follows:
A matrix of differences between observed and imputed values is computed for each observation (rows) and each variable (columns).
These differences are scaled by dividing by the standard deviation of the observed values among the reference observations.
The scaled differences are squared.
Row means are computed resulting in one value for each observation.
The square root of each of these values is taken.
These values are Euclidean distances between the target observations and their nearest references as measured using specified variables. All the variables that are used must have observed and imputed values. Generally, this will be the X-variables and not the Y-variables.
When threshold
is NULL, the function computes one using the
quantile
function with its default arguments and probs=1-p
.
A named list of several items. In all cases vectors are named using the observation
ids which are the row names of the data used to build the yai
object.
call |
The call. |
vars |
The variables used (may be fewer than requested). |
threshold |
The threshold value. |
notablyDifferent.refs |
A sorted named vector of references that exceed the threshold. |
notablyDifferent.trgs |
A sorted named vector of targets that exceed the threshold. |
rmsdS.refs |
A sorted named vector of scaled RMSD references. |
rmsdS.trgs |
A sorted named vector of scaled RMSD targets. |
Nicholas L. Crookston [email protected]
notablyDistant
, plot.notablyDifferent
,
yai
, grmsd
data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build an msn run, first build dummy variables for species. sp1 <- as.integer(iris$Species=="setosa") sp2 <- as.integer(iris$Species=="versicolor") y2 <- data.frame(cbind(iris[,4],sp1,sp2),row.names=rownames(iris)) y2 <- y2[refs,] names(y2) <- c("Petal.Width","Sp1","Sp2") msn <- yai(x=x,y=y2,method="msn") notablyDifferent(msn)
data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build an msn run, first build dummy variables for species. sp1 <- as.integer(iris$Species=="setosa") sp2 <- as.integer(iris$Species=="versicolor") y2 <- data.frame(cbind(iris[,4],sp1,sp2),row.names=rownames(iris)) y2 <- y2[refs,] names(y2) <- c("Petal.Width","Sp1","Sp2") msn <- yai(x=x,y=y2,method="msn") notablyDifferent(msn)
Notably distant targets are those with relatively large distances from the closest reference observation. A suitable threshold is used to detect large distances.
notablyDistant(object,kth=1,threshold=NULL,p=0.01,method="distribution")
notablyDistant(object,kth=1,threshold=NULL,p=0.01,method="distribution")
object |
an object of class |
kth |
the kth neighbor is used. |
threshold |
the thereshold distance that identifies notably large distances between observations. |
p |
|
method |
the method used to compute the threshold, see details. |
When threshold
is NULL, the function computes one using one of
two methods. When method
is "distribution", assumption is made that
distances follow the lognormal distribution, unless the method used
to find neighbors is randomForest
, in which case the distances
are assumed to follow the beta distribution. A specified p
value
is used to compute the threshold
, which is the point in the distribution
where a fraction, p
, of the neighbors are larger than the threshold
.
When method
is "quantile", the function uses the quantile
function with probs=1-p
.
List of two data frames that contain 1) the references that are notably distant from other references, 2) the targets that are notably distant from the references, 3) the threshold used, and 4) the method used.
Nicholas L. Crookston [email protected]
data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build an msn run, first build dummy variables for species. sp1 <- as.integer(iris$Species=="setosa") sp2 <- as.integer(iris$Species=="versicolor") y2 <- data.frame(cbind(iris[,4],sp1,sp2),row.names=rownames(iris)) y2 <- y2[refs,] names(y2) <- c("Petal.Width","Sp1","Sp2") msn <- yai(x=x,y=y2,method="msn") notablyDistant(msn)
data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species # build an msn run, first build dummy variables for species. sp1 <- as.integer(iris$Species=="setosa") sp2 <- as.integer(iris$Species=="versicolor") y2 <- data.frame(cbind(iris[,4],sp1,sp2),row.names=rownames(iris)) y2 <- y2[refs,] names(y2) <- c("Petal.Width","Sp1","Sp2") msn <- yai(x=x,y=y2,method="msn") notablyDistant(msn)
Provides a matrix of plots for objects created by compare.yai
.
## S3 method for class 'compare.yai' plot(x,pointColor=1,lineColor=2,...)
## S3 method for class 'compare.yai' plot(x,pointColor=1,lineColor=2,...)
x |
a data frame created by |
pointColor |
the color used for the points. |
lineColor |
the color of the 1:1 line. |
... |
passed to plot functions. |
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
yai
, compare.yai
,
impute.yai
, rmsd.yai
Provides a descriptive plot of the Imputation Error Profile for object(s)
created by notablyDifferent
.
## S3 method for class 'notablyDifferent' plot(x,add=FALSE,...)
## S3 method for class 'notablyDifferent' plot(x,add=FALSE,...)
x |
|
add |
set |
... |
passed to plot functions. |
Nicholas L. Crookston [email protected]
notablyDistant
and yai
require(yaImpute) data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species mal <- notablyDifferent(yai(x=x,y=y,method="mahalanobis"),vars=colnames(x)) if (require(randomForest)) { rf <- notablyDifferent(yai(x=x,y=y,method="randomForest"),vars=colnames(x)) plot.notablyDifferent(list(Mahalanobis=mal,randomForest=rf)) }
require(yaImpute) data(iris) set.seed(12345) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species mal <- notablyDifferent(yai(x=x,y=y,method="mahalanobis"),vars=colnames(x)) if (require(randomForest)) { rf <- notablyDifferent(yai(x=x,y=y,method="randomForest"),vars=colnames(x)) plot.notablyDifferent(list(Mahalanobis=mal,randomForest=rf)) }
Provides a descriptive plot of now the mean Mahalanobis distances change
as variables are added or deleted using varSelection
.
## S3 method for class 'varSel' plot(x,main=NULL,nbest=NULL,arrows=TRUE,...)
## S3 method for class 'varSel' plot(x,main=NULL,nbest=NULL,arrows=TRUE,...)
x |
an object create by |
main |
becomes the plot title, if NULL one is generated |
nbest |
number of variables designated in the plot as the best;
if null the number is computed by |
arrows |
if true, an arrow is added to the plot designating the best variables. |
... |
passed to boxplot functions |
Nicholas L. Crookston [email protected]
varSelection
and yai
require(yaImpute) data(iris) set.seed(12345) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[,3:4] # Petal.Length Petal.Width vsel <- varSelection(x=x,y=y,nboot=5,useParallel=FALSE) plot(vsel)
require(yaImpute) data(iris) set.seed(12345) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[,3:4] # Petal.Length Petal.Width vsel <- varSelection(x=x,y=y,nboot=5,useParallel=FALSE) plot(vsel)
Provides a matrix of plots of observed verses imputed values for variables in an
object created by impute.yai
, which are of class
c("impute.yai","data.frame")
.
## S3 method for class 'yai' plot(x,vars=NULL,pointColor=1,lineColor=2,spineColor=NULL,residual=FALSE,...)
## S3 method for class 'yai' plot(x,vars=NULL,pointColor=1,lineColor=2,spineColor=NULL,residual=FALSE,...)
x |
|
vars |
a list of variable names you want to include, if NULL all available Y-variables are included. |
pointColor |
a color vector for the xy plots (continuous variables). |
lineColor |
a color 1:1 lines in xy plots. |
spineColor |
a color vector for the spine plots (factors), one value per level. |
residual |
plots in a residual format (observed-imputed over imputed). |
... |
passed to called functions. |
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
require(yaImpute) data(iris) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species mal <- yai(x=x,y=y,method="mahalanobis") malImp=impute(mal,newdata=iris) plot(malImp)
require(yaImpute) data(iris) # form some test data refs=sample(rownames(iris),50) x <- iris[,1:3] # Sepal.Length Sepal.Width Petal.Length y <- iris[refs,4:5] # Petal.Width Species mal <- yai(x=x,y=y,method="mahalanobis") malImp=impute(mal,newdata=iris) plot(malImp)
Provides a generic interface for getting predicted values for yai
objects.
## S3 method for class 'yai' predict(object,newdata,...)
## S3 method for class 'yai' predict(object,newdata,...)
object |
an object of class |
newdata |
a data frame that at a minimum contains the X-variables for new observations. |
... |
passed to |
When argument newdata
is present function newtargets
is called followed by a
call to impute.yai
. If include in the ..., the arguments k
and ann
are passed to newtargets
.
When argument newdata
is absent, impute.yai
is called without first calling
newtargets
.
All of the ... arguments are passed to impute.yai
.
Another form of prediction in imputation is to get the identity of the imputed observations.
Use function foruse
for this purpose.
An object of class impute.yai
.
Nicholas L. Crookston [email protected]
Provides a summary of a yai
object, showing the
call and essential data customized for each method used.
## S3 method for class 'yai' print(x,...)
## S3 method for class 'yai' print(x,...)
x |
an object of class |
... |
not used |
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
Computes the root mean square difference (RMSD) between observed and imputed values for each observation that has both. RMSD is computationally like RMSE, but they differ in interpretation. The RMSD values can be scaled to afford comparisons among variables.
rmsd.yai (object,vars=NULL,scale=FALSE,...)
rmsd.yai (object,vars=NULL,scale=FALSE,...)
object |
an object created by |
vars |
a list of variable names you want to include, if NULL all available variables are included |
scale |
when |
... |
passed to called methods, very useful for passing arugment
|
By default, RMSD is computed using standard formula for its related statistic,
RMSE. When scale=TRUE
, or set of values is supplied, RMSD is divided by the
scaling factor. The scaling factor is the standard deviation of the
reference observations under the assumption that they are representative
of the population.
A data frame with the row names as vars and the column as rmsd
. When
scale=TRUE
, the column name is rmsdS
. The scaling factors used, if any,
are returned as an attribute.
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
yai
, impute.yai
and doi:10.18637/jss.v023.i10.
Polygon-based reference data used by Stage and Crookston (2007) to demonstrate partitioning of error components and related statistics. Observations are summaries of data collected on forest stands (ploygons).
data(TallyLake)
data(TallyLake)
A data frame with 847 rows and 29 columns:
Ground based measurements of trees (Y-variables):
TopHt - Height of tallest trees (ft)
LnVolL - Log of the volume of western larch
LnVolDF - Log of the volume of Douglas-fir
LnVolLP - Log of the volume of lodgepole pine
LnVolES - Log of the volume of Engelmann spruce
LnVolAF - Log of the volume of alpine fir
LnVolPP - Log of the volume of ponderosa pine
CCover - Canopy cover (percent)
Geographic Location, Slope, and Aspect (X-variables):
utmx - UTM easting at plot center
utmy - UTM northing at plot center
elevm - Mean elevation (ft) above sea level over plot
eevsqrd -
slopem - Mean slope (percent) over plot
slpcosaspm - Mean of slope (proportion) times the cosine of aspect (see Stage (1976) for description of this transformation
slpsinaspm - Mean of slope (proportion) times the sine of aspect
Additional X-variables:
ctim - Mean of slope curviture over pixels in stand
tmb1m - Mean of LandSat band 1 over pixels in stand
tmb2m - Mean of LandSat band 2 over pixels in stand
tmb3m - Mean of LandSat band 3 over pixels in stand
tmb4m - Mean of LandSat band 4 over pixels in stand
tmb5m - Mean of LandSat band 5 over pixels in stand
tmb6m - Mean of LandSat band 6 over pixels in stand
durm - Mean of light duration over pixels in stand
insom - Mean of solar insolation over pixels in stand
msavim - Mean of AVI for pixels in stand
ndvim - Mean of NDVI for pixels in stand
crvm - Mean of slope curviture for pixels in stand
tancrvm - Mean of tangent curvature for pixels in stand
tancrvsd - Standard deviation of tangent curvature for pixels in stand
USDA Forest Service
Stage, A.R.; Crookston, N.L. 2007. Partitioning error components for accuracy-assessment of near neighbor methods of imputation. For. Sci. 53(1):62-72 https://academic.oup.com/forestscience/article/53/1/62/4604364
Stage, A.R. (1976). An expression for the effect of aspect, slope, and habitat type on tree growth. For. Sci. 22(4):457-460. https://academic.oup.com/forestscience/article-abstract/22/4/457/4675852
Takes any combination of several data frames or matrices and creates a new data frame. The rows are defined by a union of all row names in the arguments, and the columns are defined by a union of all column names in the arguments. The data are loaded into this new frame where column and row names match the individual inputs. Duplicates are tolerated with the last one specified being the one kept. NAs are returned for combinations of rows and columns where no data exist. Factors are processed as necessary.
unionDataJoin(...,warn=TRUE)
unionDataJoin(...,warn=TRUE)
... |
a list of data frames, matrices, or any combination. |
warn |
when TRUE, warn when a column name is found in more than one data source. |
A data frame.
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
require(yaImpute) d1=data.frame(x1=c("a","b","c","d","e","f")) d2=data.frame(x1=as.character(seq(1,4)),row.names=seq(5,8)) d3=data.frame(x2=seq(1:10)) # note the levels levels(d1$x1) # [1] "a" "b" "c" "d" "e" "f" levels(d2$x1) # [1] "1" "2" "3" "4" all=unionDataJoin(d1,d2,d3,warn=FALSE) all # x1 x2 # 1 a 1 # 2 b 2 # 3 c 3 # 4 d 4 # 5 1 5 # 6 2 6 # 7 3 7 # 8 4 8 # 9 <NA> 9 # 10 <NA> 10 levels(all$x1) # [1] "1" "2" "3" "4" "a" "b" "c" "d"
require(yaImpute) d1=data.frame(x1=c("a","b","c","d","e","f")) d2=data.frame(x1=as.character(seq(1,4)),row.names=seq(5,8)) d3=data.frame(x2=seq(1:10)) # note the levels levels(d1$x1) # [1] "a" "b" "c" "d" "e" "f" levels(d2$x1) # [1] "1" "2" "3" "4" all=unionDataJoin(d1,d2,d3,warn=FALSE) all # x1 x2 # 1 a 1 # 2 b 2 # 3 c 3 # 4 d 4 # 5 1 5 # 6 2 6 # 7 3 7 # 8 4 8 # 9 <NA> 9 # 10 <NA> 10 levels(all$x1) # [1] "1" "2" "3" "4" "a" "b" "c" "d"
Provides a character vector, or a list of character vectors
of all the variables in a yai
object,
just the X-variables (xvars
), or just the Y-variables (yvars
).
vars(object) xvars(object) yvars(object)
vars(object) xvars(object) yvars(object)
object |
an object created by |
yvars |
A charactor vector of Y-variables. |
xvars |
A charactor vector of X-variables. |
vars |
A list of both vectors. |
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
Computes grmsd
(generalized root mean square distance)
as variables are added to (method="addVars"
) or removed from
(method="delVars"
) an k-NN imputation model. When adding variables
the function keeps variables that strengthen imputation and
deletes that weaken the imputation the least.
The measure of model strength is grmsd between
imputed and observed Y-variables among the reference observations.
varSelection(x,y,method="addVars",yaiMethod="msn",imputeMethod="closest", wts=NULL,nboot=20,trace=FALSE, useParallel=if (.Platform$OS.type == "windows") FALSE else TRUE,...)
varSelection(x,y,method="addVars",yaiMethod="msn",imputeMethod="closest", wts=NULL,nboot=20,trace=FALSE, useParallel=if (.Platform$OS.type == "windows") FALSE else TRUE,...)
x |
a set of X-Variables as used in |
y |
a set of Y-Variables as used in |
method |
if |
yaiMethod |
passed as |
imputeMethod |
passed as |
wts |
passed as argument |
nboot |
the number of bootstrap samples used at each variable selection step (see Details). When nboot is zero, NO bootstraping is done. |
trace |
if |
useParallel |
function |
... |
passed to |
This function tracks the effect on generalized root mean square distance
(see grmsd
) when variables are added or deleted one at a time.
When adding variables, the function starts with none, and keeps the single
variable that provides the smallest grmsd
. When deleting variables,
the functions starts with all X-Variables and deletes them one at a
time such that those that remain provide the smallest
grmsd
. The function uses the following steps:
Function yai
is run for all the Y-variables and candidate
X-variable(s). The result is passed to impute.yai
to get imputed
values of Y-variables. That result is passed to grmsd
to compute a
mean Mahalanobis distance for the case where the candidate variable is included
(or deleted depending on method
). However, these steps are done once
for each bootstrap replication and the resulting values are averaged to provide
an average mean Mahalanobis distance over the bootstraps.
Step one is done for each candidate X-variable forming a vector of
grmsd
values, one corresponding to the case where each candidate
is added or deleted.
When variables are being added (method="addVars"
), the variable that
is related to the smallest grmsd
is kept. When variables are being
deleted (method="delVars"
), the variable that
is related to the largest grmsd
is deleted.
Once a variable has been added or deleted, the function proceeds to select another variable for selection or deletion by considering all remaining varialbes.
An list of class varSel
with these tags:
call |
the call |
grmsd |
a 2-column matrix of the mean and std dev of the mean Mahalanobis distances associated with adding or removing the variables stored as the rownames. When nboot<2, the std dev are NA |
allgrmsd |
a list of the grmsd values that correspond to each bootstrap replication. The data in grmsd are based on these vectors of information. |
method |
the value of argument |
Nicholas L. Crookston [email protected]
yai
, impute.yai
, bestVars
and
grmsd
data(iris) set.seed(12345) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[,3:4] # Petal.Length Petal.Width vsel <- varSelection(x=x,y=y,nboot=5,useParallel=FALSE) vsel bestVars(vsel) plot(vsel)
data(iris) set.seed(12345) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[,3:4] # Petal.Length Petal.Width vsel <- varSelection(x=x,y=y,nboot=5,useParallel=FALSE) vsel bestVars(vsel) plot(vsel)
For each row, the function identifies the column that has the maximum value. The function returns a data frame with two columns: the first is the column name corresponding to the column of maximum value and the second is the correspond maximum. The first column is convereted to a factor.
If the maximum is zero, the maximum column is identified as "zero".
If there are over nbig
factors in column 1, the maximum values
that are less than the largest are combined and identified as "other".
Intended use is to transform community ecology data for use in yai
where method is randomForest.
whatsMax(x,nbig=30)
whatsMax(x,nbig=30)
x |
a data frame or matrix of numeric values. |
nbig |
see description–the maximum number of factors, the remainder called 'other'. |
A data frame.
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
data(MoscowMtStJoe) # get the basal area by species columns yba <- MoscowMtStJoe[,1:17] # for each row, pick the species that has the max basal area # create "other" for those not in the top 7. ybaB <- whatsMax(yba,nbig=7) levels(ybaB[,1])
data(MoscowMtStJoe) # get the basal area by species columns yba <- MoscowMtStJoe[,1:17] # for each row, pick the species that has the max basal area # create "other" for those not in the top 7. ybaB <- whatsMax(yba,nbig=7) levels(ybaB[,1])
Given a set of observations, yai
separates the observations into reference and target observations,
applies the specified method to project X-variables into a Euclidean space (not
always, see argument method
), and
finds the k-nearest neighbors within the referenece observations and between the reference and target observations.
An alternative method using randomForest
classification and regression trees is provided for steps 2 and 3.
Target observations are those with values for X-variables and
not for Y-variables, while reference observations are those
with no missing values for X-and Y-variables (see Details for the
exception).
yai(x=NULL,y=NULL,data=NULL,k=1,noTrgs=FALSE,noRefs=FALSE, nVec=NULL,pVal=.05,method="msn",ann=TRUE,mtry=NULL,ntree=500, rfMode="buildClasses",bootstrap=FALSE,ppControl=NULL,sampleVars=NULL, rfXsubsets=NULL,oob=FALSE)
yai(x=NULL,y=NULL,data=NULL,k=1,noTrgs=FALSE,noRefs=FALSE, nVec=NULL,pVal=.05,method="msn",ann=TRUE,mtry=NULL,ntree=500, rfMode="buildClasses",bootstrap=FALSE,ppControl=NULL,sampleVars=NULL, rfXsubsets=NULL,oob=FALSE)
x |
1) a matrix or data frame containing the X-variables for all
observations with row names are the identification for the observations, or 2) a
one-sided formula defining the X-variables as a linear formula. If
a formula is coded for |
y |
1) a matrix or data frame containing the Y-variables for the reference observations, or 2) a one-sided formula defining the Y-variables as a linear formula. |
data |
when |
k |
the number of nearest neighbors; default is 1. |
noTrgs |
when TRUE, skip finding neighbors for target observations. |
noRefs |
when TRUE, skip finding neighbors for reference observations. |
nVec |
number of canonical vectors to use (methods |
pVal |
significant level for canonical vectors, used when |
method |
is the strategy used for computing distance and therefore for finding neighbors; the options are quoted key words (see details):
|
ann |
TRUE if |
mtry |
the number of X-variables picked at random when method is |
ntree |
the number of classification and regression trees when method is |
rfMode |
when |
bootstrap |
if |
ppControl |
used to control how canoncial correlation analysis via projection pursuit is done, see Details. |
sampleVars |
the X- and/or Y-variables will be sampled (without replacement) if this is not NULL and greater than zero. If specified as a single unnamed value, that value is used to control the sample size of both X and Y variables. If two unnamed values, then the first is taken for X-variables and the second for Y-variables. If zero, no sampling is done. Otherwise, values are less than 1.0 they are taken as the proportion of the number of variables. Values greater or equal to 1 are number of variables to be included in the sample. Specification of a large number will cause the sequence of variables to be randomized. |
rfXsubsets |
a named list of character vectors where there is one vector for each
Y-variable, see details, only applies when |
oob |
Boolean when TRUE keep OOB in randomForest models, default FALSE. This is useful
for calculating out-of-bag evaluation statistics for the models |
See the paper at doi:10.18637/jss.v023.i10 (it includes examples).
The following information is in addition to the content in the papers.
You need not have any Y-variables to run yai for the following methods:
euclidean
, raw
, mahalanobis
, ica
, random
, and
randomForest
(in which case unsupervised classification is
performed). However, normally yai
classifies reference
observations as those with no missing values for X- and Y- variables and
target observations are those with values for X- variables and
missing data for Y-variables. When Y is NULL (there are no Y-variables),
all the observations are considered references. See
newtargets
for an example of how to use yai in this
situation.
When bootstrap=TRUE
the reference observations are sampled with replacement. The
sample size is set to the number of reference observations. Normally, about a third of
the reference observations are left out of the sample; they are often called out-of-bag
samples. The out-of-bag observations are then treated as targets.
When method="msnPP"
projection pursuit from ccaPP is used. The method is
further controlled using argument ppControl
to specify a character vector that has
has two named components.
method - One of the following
"spearman", "kendall", "quadrant", "M", "pearson"
, default is "spearman"
searc - If "data"
or "proj"
, then ccaProj
is used, otherwise the default ccaGrid
is used.
Here are some details on argument rfXsubsets
. When method="randomForest"
one call to randomForest
is generated for for each Y-variable. When
argument rfXsubsets
is left NULL
, all the X-variables are used for each of
the Y-variables. However, sometimes better results can be achieved by using specific subsets
of X-variables for each Y-variable. This is done by setting rfXsubsets
equal
to a named list of character vectors. The names correspond to the Y-variable names and the
character vectors hold the list of X-variables for the corresponding Y-variable.
An object of class yai
, which is a list with
the following tags:
call |
the call. |
yRefs , xRefs
|
matrices of the X- and Y-variables for just the reference observations (unscaled). The scale factors are attached as attributes. |
obsDropped |
a list of the row names for observations dropped for various reasons (missing data). |
trgRows |
a list of the row names for target observations as a subset of all observations. |
xall |
the X-variables for all observations. |
cancor |
returned from cancor function when method |
ccaVegan |
an object of class cca (from package vegan) when method gnn is used. |
ftest |
a list containing partial F statistics and a vector of Pr>F (pgf) corresponding to the canonical correlation coefficients when method msn or msn2 is used (NULL otherwise). |
yScale , xScale
|
scale data used on yRefs and xRefs as needed. |
k |
the value of k. |
pVal |
as input; only used when method |
projector |
NULL when not used. For methods |
nVec |
number of canonical vectors used (methods |
method |
as input, the method used. |
ranForest |
a list of the forests if method |
ICA |
a list of information from |
ann |
the value of ann, TRUE when |
xlevels |
NULL if no factors are used as predictors; otherwise a list
of predictors that have factors and their levels (see |
neiDstTrgs |
a matrix of distances between a target (identified by its row name) and the k references. There are k columns. |
neiIdsTrgs |
a matrix of reference identifications that correspond to neiDstTrgs. |
neiDstRefs , neiIdsRefs
|
counterparts for references. |
bootstrap |
a vector of reference rownames that constitute the bootstrap sample;
or the value |
Nicholas L. Crookston [email protected]
John Coulston [email protected]
Andrew O. Finley [email protected]
require (yaImpute) data(iris) # set the random number seed so that example results are consistent # normally, leave out this command set.seed(12345) # form some test data, y's are defined only for reference # observations. refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width # build yai objects using 2 methods msn <- yai(x=x,y=y) mal <- yai(x=x,y=y,method="mahalanobis") # compare these results using the generalized mean distances. mal wins! grmsd(mal,msn) # use projection pursuit and specify ppControl (loads package ccaPP) if (require(ccaPP)) { msnPP <- yai(x=x,y=y,method="msnPP",ppControl=c(method="kendall",search="proj")) grmsd(mal,msnPP,msn) } ############# data(MoscowMtStJoe) # convert polar slope and aspect measurements to cartesian # (which is the same as Stage's (1976) transformation). polar <- MoscowMtStJoe[,40:41] polar[,1] <- polar[,1]*.01 # slope proportion polar[,2] <- polar[,2]*(pi/180) # aspect radians cartesian <- t(apply(polar,1,function (x) {return (c(x[1]*cos(x[2]),x[1]*sin(x[2]))) })) colnames(cartesian) <- c("xSlAsp","ySlAsp") x <- cbind(MoscowMtStJoe[,37:39],cartesian,MoscowMtStJoe[,42:64]) y <- MoscowMtStJoe[,1:35] msn <- yai(x=x, y=y, method="msn", k=1) mal <- yai(x=x, y=y, method="mahalanobis", k=1) # the results can be plotted. plot(mal,vars=yvars(mal)[1:16]) # compare these results using the generalized mean distances.. grmsd(mal,msn) # try method="gower" if (require(gower)) { gow <- yai(x=x, y=y, method="gower", k=1) # compare these results using the generalized mean distances.. grmsd(mal,msn,gow) } # try method="randomForest" if (require(randomForest)) { # reduce the plant community data for randomForest. yba <- MoscowMtStJoe[,1:17] ybaB <- whatsMax(yba,nbig=7) # see help on whatsMax rf <- yai(x=x, y=ybaB, method="randomForest", k=1) # build the imputations for the original y's rforig <- impute(rf,ancillaryData=y) # compare the results using individual rmsd's compare.yai(mal,msn,rforig) plot(compare.yai(mal,msn,rforig)) # build another randomForest case forcing regression # to be used for continuous variables. The answers differ # but one is not clearly better than the other. rf2 <- yai(x=x, y=ybaB, method="randomForest", rfMode="regression") rforig2 <- impute(rf2,ancillaryData=y) compare.yai(rforig2,rforig) }
require (yaImpute) data(iris) # set the random number seed so that example results are consistent # normally, leave out this command set.seed(12345) # form some test data, y's are defined only for reference # observations. refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width # build yai objects using 2 methods msn <- yai(x=x,y=y) mal <- yai(x=x,y=y,method="mahalanobis") # compare these results using the generalized mean distances. mal wins! grmsd(mal,msn) # use projection pursuit and specify ppControl (loads package ccaPP) if (require(ccaPP)) { msnPP <- yai(x=x,y=y,method="msnPP",ppControl=c(method="kendall",search="proj")) grmsd(mal,msnPP,msn) } ############# data(MoscowMtStJoe) # convert polar slope and aspect measurements to cartesian # (which is the same as Stage's (1976) transformation). polar <- MoscowMtStJoe[,40:41] polar[,1] <- polar[,1]*.01 # slope proportion polar[,2] <- polar[,2]*(pi/180) # aspect radians cartesian <- t(apply(polar,1,function (x) {return (c(x[1]*cos(x[2]),x[1]*sin(x[2]))) })) colnames(cartesian) <- c("xSlAsp","ySlAsp") x <- cbind(MoscowMtStJoe[,37:39],cartesian,MoscowMtStJoe[,42:64]) y <- MoscowMtStJoe[,1:35] msn <- yai(x=x, y=y, method="msn", k=1) mal <- yai(x=x, y=y, method="mahalanobis", k=1) # the results can be plotted. plot(mal,vars=yvars(mal)[1:16]) # compare these results using the generalized mean distances.. grmsd(mal,msn) # try method="gower" if (require(gower)) { gow <- yai(x=x, y=y, method="gower", k=1) # compare these results using the generalized mean distances.. grmsd(mal,msn,gow) } # try method="randomForest" if (require(randomForest)) { # reduce the plant community data for randomForest. yba <- MoscowMtStJoe[,1:17] ybaB <- whatsMax(yba,nbig=7) # see help on whatsMax rf <- yai(x=x, y=ybaB, method="randomForest", k=1) # build the imputations for the original y's rforig <- impute(rf,ancillaryData=y) # compare the results using individual rmsd's compare.yai(mal,msn,rforig) plot(compare.yai(mal,msn,rforig)) # build another randomForest case forcing regression # to be used for continuous variables. The answers differ # but one is not clearly better than the other. rf2 <- yai(x=x, y=ybaB, method="randomForest", rfMode="regression") rforig2 <- impute(rf2,ancillaryData=y) compare.yai(rforig2,rforig) }
When method randomforest
is used to build a yai
object, the randomForest
package computes
several statistics. This function summarizes some of them, including
the variable importance scores computed by function yaiVarImp
.
yaiRFsummary(object, nTop=0)
yaiRFsummary(object, nTop=0)
object |
an object of class |
nTop |
the |
A list containing:
forestAttributes |
a data frame reporting the error rates and other data from the randomForest(s). |
scaledImportance |
the data frame computed by |
Nicholas L. Crookston [email protected]
Andrew O. Finley [email protected]
When method randomforest
is used to build a yai
object, the randomForest
package computes
variable importance scores. This function computes a composite of the
scores and scales them using scale
. By default the
scores are plotted and scores themselves are invisibly returned. For
classification, the scores are derived from "MeanDecreaseAccuracy"
and for regression they are based in "
using importance
.
yaiVarImp(object, nTop=20, plot=TRUE, ...)
yaiVarImp(object, nTop=20, plot=TRUE, ...)
object |
an object of class |
nTop |
the |
plot |
if FALSE, no plotting is done, but the scores are returned. |
... |
passed to the |
A data frame with the rows corresponding to the randomForest
built for each Y-variable and the columns corresponding to the
nTop
most important Y-variables in sorted order.
Nicholas L. Crookston [email protected]
yai
, yaiRFsummary
, compare.yai
if (require(randomForest)) { data(MoscowMtStJoe) # get the basal area by species columns yba <- MoscowMtStJoe[,1:17] ybaB <- whatsMax(yba,nbig=7) # see help on whatsMax ba <- cbind(ybaB,TotalBA=MoscowMtStJoe[,18]) x <- MoscowMtStJoe[,37:64] x <- x[,-(4:5)] rf <- yai(x=x,y=ba,method="randomForest") yaiVarImp(rf) keep=colnames(yaiVarImp(rf,plot=FALSE,nTop=9)) newx <- x[,keep] rf2 <- yai(x=newx,y=ba,method="randomForest") yaiVarImp(rf2,col="gray") compare.yai(rf,rf2) }
if (require(randomForest)) { data(MoscowMtStJoe) # get the basal area by species columns yba <- MoscowMtStJoe[,1:17] ybaB <- whatsMax(yba,nbig=7) # see help on whatsMax ba <- cbind(ybaB,TotalBA=MoscowMtStJoe[,18]) x <- MoscowMtStJoe[,37:64] x <- x[,-(4:5)] rf <- yai(x=x,y=ba,method="randomForest") yaiVarImp(rf) keep=colnames(yaiVarImp(rf,plot=FALSE,nTop=9)) newx <- x[,keep] rf2 <- yai(x=newx,y=ba,method="randomForest") yaiVarImp(rf2,col="gray") compare.yai(rf,rf2) }