his thesis presents a method for distributed multivariate regression usingwaveletbased
Collective Data Mining (CDM). The method seamlessly blends machine
learning and the theory of communication with the statistical methods employed
in parametric multivariate regression to provide an effective data mining technique
for use in a distributed data and computation environment. The technique is applied
to two benchmark data sets, producing results that are consistent with those
obtained by applying standard parametric regression techniques to centralized data
sets. Evaluation of the method in terms of model accuracy as a function of appropriateness
of the selected wavelet function, relative number of nonlinear
crossterms,
and sample size demonstrates that accurate parametric multivariate regression models
can be generated from distributed, heterogeneous, data sets with minimal data
communication overhead compared to that required to aggregate a distributed data
set. Application of this method to Linear Discriminant Analysis, which is related
to parametric multivariate regression, produced classification results on the Iris data
set that are comparable to those obtained with centralized data analysis.