21.11 Remove Numbers

docs <- tm_map(docs, removeNumbers)
viewDocs(docs, 16)
## hybrid weighted random forests for
## classifying very high-dimensional data
## baoxun xu , joshua zhexue huang , graham williams and
## yunming ye
## 
## 
## department of computer science, harbin institute of technology shenzhen graduate
## school, shenzhen , china
## 
## shenzhen institutes of advanced technology, chinese academy of sciences, shenzhen
## , china
## email: amusing gmail.com
## random forests are a popular classification method based on an ensemble of a
## single type of decision trees from subspaces of data. in the literature, there
## are many different types of decision tree algorithms, including c., cart, and
## chaid. each type of decision tree algorithm may capture different information
## and structure. this paper proposes a hybrid weighted random forest algorithm,
## simultaneously using a feature weighting method and a hybrid forest method to
## classify very high dimensional data. the hybrid weighted random forest algorithm
## can effectively reduce subspace size and improve classification performance
## without increasing the error bound. we conduct a series of experiments on eight
## high dimensional datasets to compare our method with traditional random forest
## methods and other classification methods. the results show that our method
## consistently outperforms these traditional methods.
## keywords: random forests; hybrid weighted random forest; classification; decision tree;
## 
## .
## 
## introduction
## 
## random forests [, ] are a popular classification
## method which builds an ensemble of a single type
## of decision trees from different random subspaces of
## data. the decision trees are often either built using
## c. [] or cart [], but only one type within
## a single random forest. in recent years, random
## forests have attracted increasing attention due to
## () its competitive performance compared with other
## classification methods, especially for high-dimensional
## data, () algorithmic intuitiveness and simplicity, and
## () its most important capability - "ensemble" using
## bagging [] and stochastic discrimination [].
## several methods have been proposed to grow random
## forests from subspaces of data [, , , , , , ]. in
## these methods, the most popular forest construction
## procedure was proposed by breiman [] to first use
## bagging to generate training data subsets for building
## individual trees.
## a subspace of features is then
## randomly selected at each node to grow branches of
## a decision tree. the trees are then combined as an
## ensemble into a forest. as an ensemble learner, the
## performance of a random forest is highly dependent
## on two factors: the performance of each tree and the
## diversity of the trees in the forests []. breiman
## formulated the overall performance of a set of trees as
## the average strength and proved that the generalization
## 
## error of a random forest is bounded by the ratio of the
## average correlation between trees divided by the square
## of the average strength of the trees.
## for very high dimensional data, such as text data,
## there are usually a large portion of features that are
## uninformative to the classes. during this forest building
## process, informative features would have the large
## chance to be missed, if we randomly select a small
## subspace (breiman suggested selecting log (m ) + 
## features in a subspace, where m is the number of
## independent features in the data) from high dimensional
## data []. as a result, weak trees are created from these
## subspaces, the average strength of those trees is reduced
## and the error bound of the random forest is enlarged.
## therefore, when a large proportion of such "weak"
## trees are generated in a random forest, the forest has a
## large likelihood to make a wrong decision which mainly
## results from those "weak" trees' classification power.
## to address this problem, we aim to optimize decision
## trees of a random forest by two strategies. one
## straightforward strategy is to enhance the classification
## performance of individual trees by a feature weighting
## method for subspace sampling [, , ]. in this
## method, feature weights are computed with respect
## to the correlations of features to the class feature
## and regarded as the probabilities of the feature to
## be selected in subspaces. this method obviously
## increases the classification performance of individual
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## 
## 
## baoxun xu, joshua zhexue huang, graham williams, yunming ye
## 
## trees because the subspaces will be biased to contain
## more informative features. however, the chance of more
## correlated trees is also increased since the features with
## large weights are likely to be repeatedly selected.
## the second strategy is more straightforward: use
## several different types of decision trees for each training
## data subset, to increase the diversity of the trees,
## and then select the optimal tree as the individual
## tree classifier in the random forest model. the work
## presented here extends the algorithm developed in [].
## specifically, we build three different types of tree
## classifiers (c., cart, and chaid [, ]) for each
## training data subset. we then evaluate the performance
## of the three classifiers and select the best tree. in
## this way, we build a hybrid random forest which may
## include different types of decision trees in the ensemble.
## the added diversity of the decision trees can effectively
## improve the accuracy of each tree in the forest, and
## hence the classification performance of the ensemble.
## however, when we use this method to build the best
## random forest model for classifying high dimensional
## data, we can not be sure of what subspace size is best.
## in this paper, we propose a hybrid weighted random
## forest algorithm by simultaneously using a new feature
## weighting method together with the hybrid random
## forest method to classify high dimensional data. in
## this new random forest algorithm, we calculate feature
## weights and use weighted sampling to randomly select
## features for subspaces at each node in building different
## types of trees classifiers (c., cart, and chaid) for
## each training data subset, and select the best tree as
## the individual tree in the final ensemble model.
## experiments were performed on  high dimensional
## text datasets with dimensions ranging from  to
## . we compared the performance of eight random
## forest methods and well-known classification methods:
## c. random forest, cart random forest, chaid
## random forest, hybrid random forest, c. weighted
## random forest, cart weighted random forest, chaid
## weighted random forest, hybrid weighted random
## forest, support vector machines [], naive bayes [],
## and k-nearest neighbors [].
## the experimental
## results show that our hybrid weighted random forest
## achieves improved classification performance over the
## ten competitive methods.
## the remainder of this paper is organized as follows.
## in section , we introduce a framework for building
## a hybrid weighted random forest, and describe a new
## random forest algorithm. section  summarizes four
## measures to evaluate random forest models. we present
## experimental results on  high dimensional text datasets
## in section . section  contains our conclusions.
## 
## table . contingency table of input feature a and class
## feature y
## y = y . . .
## y = yj . . .
## y = yq total
## a = a
## 
## ...
## j
## ...
## q
## *
## ..
## ..
## ..
## ..
## ..
## ..
## .
## .
## ...
## .
## .
## .
## .
## a = ai
## i
## ...
## ij
## ...
## iq
## i*
## ..
## ..
## ..
## ..
## ..
## ..
## ...
## .
## .
## .
## .
## .
## .
## a = ap
## p
## ...
## pj
## ...
## pq
## p*
## total
## *
## ...
## *j
## ...
## *q
## 
## 
## general framework for building hybrid random forests.
## by integrating these two methods, we propose a novel
## hybrid weighted random forest algorithm.
## ..
## 
## let y be the class (or target) feature with q distinct
## class labels yj for j = , * * * , q. for the purposes of
## our discussion we consider a single categorical feature
## a in dataset d with p distinct category values. we
## denote the distinct values by ai for i = , * * * , p.
## numeric features can be discretized into p intervals with
## a supervised discretization method.
## assume d has val objects. the size of the subset of
## d satisfying the condition that a = ai and y = yj is
## denoted by ij . considering all combinations of the
## categorical values of a and the labels of y , we can
## obtain a contingency table [] of a against y as shown
## in table . the far right column contains the marginal
## totals for feature a:
## 
## hybrid
## forests
## 
## weighted
## 
## random
## 
## in this section, we first introduce a feature weighting
## method for subspace sampling. then we present a
## 
## q
## 
## 
## i. =
## 
## ij
## 
## for i = , * * * , p
## 
## ()
## 
## j=
## 
## and the bottom row is the marginal totals for class
## feature y :
## .j =
## 
## p
## 
## 
## ij
## 
## for j = , * * * , q
## 
## ()
## 
## i=
## 
## the grand total (the total number of samples) is in
## the bottom right corner:
## =
## 
## q 
## p
## 
## 
## ij
## 
## ()
## 
## j= i=
## 
## given a training dataset d and feature a we first
## compute the contingency table. the feature weights are
## then computed using the two methods to be discussed
## in the following subsection.
## ..
## 
## .
## 
## notation
## 
## feature weighting method
## 
## in this subsection, we give the details of the feature
## weighting method for subspace sampling in random
## forests. consider an m-dimensional feature space
## {a , a , . . . , am }. we present how to compute the
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## hybrid weighted random forests for classifying very high-dimensional data
## weights {w , w , . . . , wm } for every feature in the space.
## these weights are then used in the improved algorithm
## to grow each decision tree in the random forest.
## ... feature weight computation
## the weight of feature a represents the correlation
## between the values of feature a and the values of the
## class feature y . a larger weight will indicate that the
## class labels of objects in the training dataset are more
## correlated with the values of feature a, indicating that
## a is more informative to the class of objects. thus it
## is suggested that a has a stronger power in predicting
## the classes of new objects.
## in the following, we propose to use the chi-square
## statistic to compute feature weights because this
## method can quantify the correspondence between two
## categorical variables.
## given the contingency table of an input feature a and
## the class feature y of dataset d, the chi-square statistic
## of the two features is computed as:
## corr(a, y ) =
## 
## q
## p 
## 
## (ij - tij )
## tij
## i= j=
## 
## ()
## 
## where ij is the observed frequency from the
## contingency table and tij is the expected frequency
## computed as
## i* x *j
## tij =
## 
## 
## ()
## 
## the larger the measure corr(a, y ), the more
## informative the feature a is in predicting class y .
## ... normalized feature weight
## in practice, feature weights are normalized for feature
## subspace sampling. we use corr(a, y ) to measure the
## informativeness of these features and consider them
## as feature weights. however, to treat the weights as
## probabilities of features, we normalize the measures to
## ensure the sum of the normalized feature weights is
## equal to . let corr(ai , y ) (  i  m ) be the set
## of m feature measures. we compute the normalized
## weights as
## 
## corr(ai , y )
## wi = n 
## ()
## i= corr(ai , y )
## here, we use the square root to smooth the values of
## the measures. wi can be considered as the probability
## that feature ai is randomly sampled in a subspace. the
## more informative a feature is, the larger the weight and
## the higher the probability of the feature being selected.
## 
## diversity is commonly obtained by using bagging and
## random subspace sampling. we introduce a further
## element of diversity by using different types of trees.
## considering an analogy with forestry, the different data subsets from bagging represent the "soil structures." different decision tree algorithms represent "different tree species". our approach has two key aspects:
## one is to use three types of decision tree algorithms to
## generate three different tree classifiers for each training data subset; the other is to evaluate the accuracy
## of each tree as the measure of tree importance. in this
## paper, we use the out-of-bag accuracy to assess the importance of a tree.
## following breiman [], we use bagging to generate
## a series of training data subsets from which we build
## trees. for each tree, the data subset used to grow
## the tree is called the "in-of-bag" (iob) data and the
## remaining data subset is called the "out-of-bag" (oob)
## data. since oob data is not used for building trees
## we can use this data to objectively evaluate each tree's
## accuracy and importance. the oob accuracy gives an
## unbiased estimate of the true accuracy of a model.
## given n instances in a training dataset d and a tree
## classifier hk (iobk ) built from the k'th training data
## subset iobk , we define the oob accuracy of the tree
## hk (iobk ), for di  d, as:
## n
## oobacck =
## 
## framework for building a hybrid random
## forest
## 
## as an ensemble learner, the performance of a random
## forest is highly dependent on two factors: the diversity
## among the trees and the accuracy of each tree [].
## 
## i=
## 
## i(hk (di ) = yi ; di  oobk )
## n
## i= i(di  oobk )
## 
## ()
## 
## where i(.) is an indicator function. the larger the
## oobacck , the better the classification quality of a tree.
## we use the out-of-bag data subset oobi to calculate
## the out-of-bag accuracies of the three types of trees
## (c., cart and chaid) with evaluation values e ,
## e and e respectively.
## fig.  illustrates the procedure for building a hybrid
## random forest model. firstly, a series of iob oob
## datasets are generated from the entire training dataset
## by bagging. then, three types of tree classifiers (c.,
## cart and chaid) are built using each iob dataset.
## the corresponding oob dataset is used to calculate the
## oob accuracies of the three tree classifiers. finally,
## we select the tree with the highest oob accuracy as
## the final tree classifier, which is included in the hybrid
## random forest.
## building a hybrid random forest model in this
## way will increase the diversity among the trees.
## the classification performance of each individual tree
## classifier is also maximized.
## ..
## 
## ..
## 
## 
## 
## decision tree algorithms
## 
## the core of our approach is the diversity of decision
## tree algorithms in our random forest. different decision
## tree algorithms grow structurally different trees from
## the same training data. selecting a good decision tree
## algorithm to grow trees for a random forest is critical
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## 
## 
## baoxun xu, joshua zhexue huang, graham williams, yunming ye
## the difference lies in the way to split a node, such
## as the split functions and binary branches or multibranches. in this work we use these different decision
## tree algorithms to build a hybrid random forest.
## 
## ..
## 
## figure . the hybrid random forests framework.
## 
## for the performance of the random forest. few studies
## have considered how different decision tree algorithms
## affect a random forest. we do so in this paper.
## the common decision tree algorithms are as follows:
## classification trees . (c.) is a supervised
## learning classification algorithm used to construct
## decision trees. given a set of pre-classified objects, each
## described by a vector of attribute values, we construct
## a mapping from attribute values to classes. c. uses
## a divide-and-conquer approach to grow decision trees.
## beginning with the entire dataset, a tree is constructed
## by considering each predictor variable for dividing the
## dataset. the best predictor is chosen at each node
## using a impurity or diversity measure. the goal is
## to produce subsets of the data which are homogeneous
## with respect to the target variable. c. selects the test
## that maximizes the information gain ratio (igr) [].
## classification and regression tree (cart) is
## a recursive partitioning method that can be used for
## both regression and classification. the main difference
## between c. and cart is the test selection and
## evaluation process.
## chi-squared automatic interaction detector
## (chaid) method is based on the chi-square test of
## association. a chaid decision tree is constructed
## by repeatedly splitting subsets of the space into two
## or more nodes. to determine the best split at any
## node, any allowable pair of categories of the predictor
## variables is merged until there is no statistically
## significant difference within the pair with respect to the
## target variable [, ].
## from these decision tree algorithms, we can see that
## 
## hybrid weighted random forest algorithm
## 
## in this subsection we present a hybrid weighted
## random forest algorithm by simultaneously using the
## feature weights and a hybrid method to classify high
## dimensional data. the benefits of our algorithm has
## two aspects: firstly, compared with hybrid forest
## method [], we can use a small subspace size to
## create accurate random forest models.
## secondly,
## compared with building a random forest using feature
## weighting [], we can use several different types of
## decision trees for each training data subset to increase
## the diversities of trees. the added diversity of the
## decision trees can effectively improve the classification
## performance of the ensemble model. the detailed steps
## are introduced in algorithm .
## input parameters to algorithm  include a training
## dataset d, the set of features a, the class feature y ,
## the number of trees in the random forest k and the
## size of subspaces m. the output is a random forest
## model m . lines - form the loop for building k
## decision trees. in the loop, line  samples the training
## data d by sampling with replacement to generate an
## in-of-bag data subset iobi for building a decision tree.
## line - build three types of tree classifiers (c.,
## cart, and chaid). in this procedure, line  calls
## the function createt reej () to build a tree classifier.
## line  calculates the out-of-bag accuracy of the tree
## classifier. after this procedure, line  selects the tree
## classifier with the maximum out-of-bag accuracy. k
## decision tree trees are thus generated to form a hybrid
## weighted random forest model m .
## generically, function createt reej () first creates a
## new node. then, it tests the stopping criteria to decide
## whether to return to the upper node or to split this
## node. if we choose to split this node, then the feature
## weighting method is used to randomly select m features
## as the subspace for node splitting. these features
## are used as candidates to generate the best split to
## partition the node. for each subset of the partition,
## createt reej () is called again to create a new node under
## the current node. if a leaf node is created, it returns to
## the parent node. this recursive process continues until
## a full tree is generated.
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## hybrid weighted random forests for classifying very high-dimensional data
## algorithm  new random forest algorithm
## : input:
## : - d : the training dataset,
## : - a : the features space {a , a , ..., am },
## : - y : the class features space {y , y , ..., yq },
## : - k : the number of trees,
## : - m : the size of subspaces.
## : output: a random forest m ;
## : method:
## : for i =  to k do
## :
## draw a bootstrap sample in-of-bag data subset
## iobi and out-of-bag data subset oobi from
## training dataset d;
## :
## for j =  to  do
## :
## hi,j (iobi ) = createt reej ();
## use out-of-bag data subset oobi to calculate
## :
## the out-of-bag accuracy oobacci, j of the tree
## classifier hi,j (iobi ) by equation();
## :
## end for
## :
## select hi (iobi ) with the highest out-of-bag
## accuracy oobacci as optimal tree i;
## : end for
## : combine
## the
## k
## tree
## classifiers
## h (iob ), h (iob ), ..., hk (iobk ) into a random
## forest m ;
## :
## :
## :
## :
## :
## :
## :
## :
## :
## :
## :
## :
## :
## :
## :
## 
## .
## 
## function createtree()
## create a new node n ;
## if stopping criteria is met then
## return n as a leaf node;
## else
## for j =  to m do
## compute
## the
## informativeness
## measure
## corr(aj , y ) by equation ();
## end for
## compute feature weights {w , w , ..., wm } by
## equation ();
## use the feature weighting method to randomly
## select m features;
## use these m features as candidates to generate
## the best split for the node to be partitioned;
## call createtree() for each split;
## end if
## return n ;
## evaluation measures
## 
## in this paper, we use five measures, i.e., strength,
## correlation, error bound c s , test accuracy, and f
## metric, to evaluate our random forest models. strength
## measures the collective performance of individual trees
## in a random forest and the correlation measures the
## diversity of the trees. the ratio of the correlation
## over the square of the strength c s indicates the
## generalization error bound of the random forest model.
## these three measures were introduced in []. the
## accuracy measures the performance of a random forest
## model on unseen test data. the f metric is a
## 
## 
## 
## commonly used measure of classification performance.
## ..
## 
## strength and correlation measures
## 
## we follow breiman's method described in [] to
## calculate the strength, correlation and the ratio c s .
## following breiman's notation, we denote strength as
## s and correlation as . let hk (iobk ) be the kth
## tree classifier grown from the kth training data iobk
## sampled from d with replacement.
## assume the
## random forest model contains k trees. the out-of-bag
## proportion of votes for di  d on class j is
## k
## i(hk (di ) = j; di 
##   iobk )
## q(di , j) = k=k
## ()
##   iobk )
## k= i(di 
## this is the number of trees in the random forest
## which are trained without di and classify di into class
## j, divided by the number of training datasets not
## containing di .
## the strength s is computed as:
## 
## (q(di , yi ) - maxj=yi q(di , j))
## n i=
## n
## 
## s=
## 
## ()
## 
## where n is the number of objects in d and yi indicates
## the true class of di .
## the correlation  is computed as:
## n
## 
## 
## 
## i= (q(di , yi ) - maxj=yi q(di , j)) - s
## n
## ()
##  =
## 
## 
## k
## 
## (k
## k + (pk - pk ) )
## k= pk + p
## where
## 
## n
## pk =
## 
## i=
## 
## i(hk (di ) = yi ; di 
##   iobk )
## n
##   iobk )
## i= i(di 
## 
## ()
## 
## and
## n
## pk =
## 
## i=
## 
## i(hk (di ) = j(di , y ); di 
##   iobk )
## n
## i(d
## 
##  
## iob
## )
## i
## k
## i=
## 
## ()
## 
## where
## j(di , y ) = argmaxj=yi q(d, j)
## 
## ()
## 
## is the class that obtains the maximal number of votes
## among all classes but the true class.
## ..
## 
## general error bound measure c s
## 
## given the strength and correlation, the out-of-bag
## estimate of the c s measure can be computed.
## an important theoretical result in breiman's method
## is the upper bound of the generalization error of the
## random forest ensemble that is derived as
## p e  ( - s ) s
## 
## ()
## 
## where  is the mean value of correlations between all
## pairs of individual classifiers and s is the strength of
## the set of individual classifiers that is estimated as the
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## 
## 
## baoxun xu, joshua zhexue huang, graham williams, yunming ye
## 
## average accuracy of individual classifiers on d with
## out-of-bag evaluation. this inequality shows that the
## generalization error of a random forest is affected by
## the strength of individual classifiers and their mutual
## correlations. therefore, breiman defined the c s ratio
## to measure a random forest as
## c s =  s
## 
## ()
## 
## the smaller the ratio, the better the performance of
## the random forest. as such, c s gives guidance for
## reducing the generalization error of random forests.
## ..
## 
## test accuracy
## 
## the test accuracy measures the classification performance of a random forest on the test data set. let
## dt be a test data and yt be the class labels. given
## di  dt , the number of votes for di on class j is
## n (di , j) =
## 
## k
## 
## 
## i(hk (di ) = j)
## 
## ()
## 
## table .
## summary statistic of  high-dimensional
## datasets
## name
## features
## instances
## classes % minority
## fbis
## 
## 
## 
## .
## re
## 
## 
## 
## .
## re
## 
## 
## 
## .
## tr
## 
## 
## 
## .
## wap
## 
## 
## 
## .
## tr
## ,
## 
## 
## .
## las
## ,
## 
## 
## .
## las
## ,
## 
## 
## .
## 
## it emphasizes the performance of a classifier on rare
## categories. define  and  as follows:
## 
## i =
## 
## t pi
## t pi
## , i =
## (t pi + f pi )
## (t pi + f ni )
## 
## ()
## 
## f  for each category i and the macro-averaged f
## are computed as:
## 
## k=
## 
## the test accuracy is calculated as
## f i =
## 
## i(n (di , yi ) - maxj=yi n (di , j) > ) ()
## n i=
## 
## i i
## , m acrof  =
## i + i
## 
## q
## i=
## 
## q
## 
## f i
## 
## ()
## 
## n
## 
## acc =
## 
## where n is the number of objects in dt and yi indicates
## the true class of di .
## ..
## 
## f metric
## 
## to evaluate the performance of classification methods
## in dealing with an unbalanced class distribution, we use
## the f metric introduced by yang and liu []. this
## measure is equal to the harmonic mean of recall ()
## and precision (). the overall f score of the entire
## classification problem can be computed by a microaverage and a macro-average.
## micro-averaged f is computed globally over all
## classes, and emphasizes the performance of a classifier
## on common classes. define  and  as follows:
## q
## 
## q
## t pi
## i= t pi
##  = q i=
## ,  = q
## ()
## i= (t pi + f pi )
## i= (t pi + f ni )
## where q is the number of classes. t pi (true positives)
## is the number of objects correctly predicted as class i,
## f pi (false positives) is the number of objects that are
## predicted to belong to class i but do not. the microaveraged f is computed as:
## m icrof  =
## 
## 
## +
## 
## ()
## 
## macro-averaged f is first computed locally over
## each class, and then the average over all classes is taken.
## 
## the larger the microf and macrof values are, the
## higher the classification performance of the classifier.
## .
## 
## experiments
## 
## in this section, we present two experiments that
## demonstrate the effectiveness of the new random
## forest algorithm for classifying high dimensional data.
## high dimensional datasets with various sizes and
## characteristics were used in the experiments. the
## first experiment is designed to show how our proposed
## method can reduce the generalization error bound
## c s , and improve test accuracy when the size of
## the selected subspace is not too large. the second
## experiment is used to demonstrate the classification
## performance of our proposed method in comparison to
## other classification methods, i.e. svm, nb and knn.
## ..
## 
## datasets
## 
## in the experiments, we used eight real-world high
## dimensional datasets. these datasets were selected
## due to their diversities in the number of features, the
## number of instances, and the number of classes. their
## dimensionalities vary from  to ,. instances
## vary from  to  and the minority class rate varies
## from .% to .%. in each dataset, we randomly
## select % of instances as the training dataset, and
## the remaining data as the test dataset. detailed
## information of the eight datasets is listed in table .
## the fbis, re, re, tr, wap, tr, las
## and las datasets are classical text classification
## benchmark datasets which were carefully selected and
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## hybrid weighted random forests for classifying very high-dimensional data
## preprocessed by han and karypis []. dataset fbis
## was compiled from the foreign broadcast information
## service trec- []. the datasets re and re were
## selected from the reuters- text categorization test
## collection distribution . []. the datasets tr and
## tr were derived from trec- [], trec- [],
## and trec- []. dataset wap is from the webace
## project (wap) []. the datasets las and las were
## selected from the los angeles times for trec- [].
## the classes of these datasets were generated from the
## relevance judgment provided in these collections.
## ..
## 
## performance comparisons between random forest methods
## 
## the purpose of this experiment was to evaluate
## the effect of the hybrid weighted random forest
## method (h w rf) on strength, correlation, c s ,
## and test accuracy.
## the eight high dimensional
## datasets were analyzed and results were compared
## with seven other random forest methods, i.e., c.
## random forest (c. rf), cart random forest
## (cart rf), chaid random forest (chaid rf),
## hybrid random forest (h rf), c. weighted random
## forest (c. w rf), cart weighted random forest
## (cart w rf), chaid weighted random forest
## (chaid w rf). for each dataset, we ran each
## random forest algorithm against different sizes of the
## feature subspaces. since the number of features in these
## datasets was very large, we started with a subspace
## of  features and increased the subspace by  more
## features each time. for a given subspace size, we built
##  trees for each random forest model. in order to
## obtain a stable result, we built  random forest models
## for each subspace size, each dataset and each algorithm,
## and computed the average values of the four measures
## of strength, correlation, c s , and test accuracy as the
## final results for comparison. the performance of the
## eight random forest algorithms on the four measures
## for each of the  datasets is shown in figs. , , , and
## .
## fig.  plots the strength for the eight methods against
## different subspace sizes on each of the  datasets.
## in the same subspace, the higher the strength, the
## better the result. from the curves, we can see that
## the new algorithm (h w rf) consistently performs
## better than the seven other random forest algorithms.
## the advantages are more obvious for small subspaces.
## the new algorithm quickly achieved higher strength
## as the subspace size increases.
## the seven other
## random forest algorithms require larger subspaces to
## achieve a higher strength. these results indicate that
## the hybrid weighted random forest algorithm enables
## random forest models to achieve a higher strength
## for small subspace sizes compared to the seven other
## random forest algorithms.
## fig.  plots the curves for the correlations for the
## eight random forest methods on the  datasets. for
## 
## 
## 
## small subspace sizes, h rf, c. rf, cart rf,
## and chaid rf produce higher correlations between
## the trees on all datasets. the correlation decreases
## as the subspace size increases. for the random forest
## models the lower the correlation between the trees
## then the better the final model.
## with our new
## random forest algorithm (h w rf) a low correlation
## level was achieved with very small subspaces in all
##  datasets. we also note that as the subspace size
## increased the correlation level increased as well. this is
## understandable because as the subspace size increases,
## the same informative features are more likely to be
## selected repeatedly in the subspaces, increasing the
## similarity of the decision trees. therefore, the feature
## weighting method for subspace selection works well for
## small subspaces, at least from the point of view of the
## correlation measure.
## fig.  shows the error bound indicator c s for the
## eight methods on the  datasets. from these figures
## we can observe that as the subspace size increases, c s
## consistently reduces. the behaviour indicates that a
## subspace size larger than log (m )+ benefits all eight
## algorithms. however, the new algorithm (h w rf)
## achieved a lower level of c s at subspace size of
## log (m ) +  than the seven other algorithms.
## fig.  plots the curves showing the accuracy of the
## eight random forest models on the test datasets from
## the  datasets. we can clearly see that the new random
## forest algorithm (h w rf) outperforms the seven
## other random forest algorithms in all eight data sets.
## it can be seen that the new method is more stable
## in classification performance than other methods. in
## all of these figures, it is observed that the highest test
## accuracy is often obtained with the default subspace size
## of log (m ) + . this implies that in practice, large
## size subspaces are not necessary to grow high-quality
## trees for random forests.
## ..
## 
## performance comparisons
## classification methods
## 
## with
## 
## other
## 
## we conducted a further experimental comparison
## against three other widely used text classification
## methods: support vector machines (svm), naive
## bayes (nb), and k-nearest neighbor (knn). the
## support vector machine used a linear kernel with a
## regularization parameter of ., which was often
## used in text categorization. for naive bayes, we
## adopted the multi-variate bernoulli event model that
## is frequently used in text classification []. for knearest neighbor (knn), we set the number k of
## neighbors to . in the experiments, we used weka's
## implementation for these three text classification
## methods []. we used a single subspace size of
## features in all eight datasets to run the random forest
## algorithms. for h rf, c. rf, cart rf, and
## chaid rf, we used a subspace size of  features in
## the first  datasets (i.e., fbis, re, re, tr, wap, and
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## 
## 
## baoxun xu, joshua zhexue huang, graham williams, yunming ye
## fbis
## 
## re
## 
## .
## 
## .
## 
## .
## 
## .
## 
## .
## 
## strength
## 
## strength
## 
## .
## 
## .
## 
## .
## 
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## 
## .
## 
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## 
## chaid_w_rf
## 
## chaid_w_rf
## 
## .
## 
## h_rf
## 
## .
## 
## h_rf
## 
## c._rf
## 
## c._rf
## 
## .
## 
## cart_rf
## 
## .
## 
## cart_rf
## 
## chaid_rf
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## number of features
## 
## re
## 
## .
## 
## 
## 
## tr
## 
## .
## 
## .
## 
## .
## 
## .
## .
## 
## .
## 
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## 
## strength
## 
## strength
## 
## .
## .
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## 
## chaid_w_rf
## 
## .
## 
## chaid_w_rf
## 
## h_rf
## 
## h_rf
## 
## .
## 
## c._rf
## 
## .
## 
## c._rf
## 
## cart_rf
## 
## cart_rf
## 
## .
## 
## chaid_rf
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## number of features
## 
## wap
## 
## tr
## 
## .
## 
## .
## .
## 
## .
## 
## .
## 
## strength
## 
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## 
## .
## 
## strength
## 
## .
## 
## .
## 
## .
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## chaid_w_rf
## 
## chaid_w_rf
## h_rf
## 
## .
## 
## h_rf
## 
## .
## 
## c._rf
## 
## c._rf
## 
## cart_rf
## 
## cart_rf
## 
## .
## 
## .
## 
## chaid_rf
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## las
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## las
## 
## .
## 
## .
## 
## .
## 
## .
## 
## .
## 
## .
## 
## .
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## chaid_w_rf
## 
## .
## 
## strength
## 
## strength
## 
## 
## 
## number of features
## 
## number of features
## 
## .
## 
## 
## 
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## 
## .
## 
## chaid_w_rf
## 
## h_rf
## c._rf
## 
## .
## 
## h_rf
## 
## .
## 
## c._rf
## 
## cart_rf
## chaid_rf
## 
## cart_rf
## 
## .
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## figure . strength changes against the number of features in the subspace on the  high dimensional datasets
## 
## tr) to run the random forest algorithms, and used
## a subspace size of  features in the last  datasets
## (las and las) to run these random forest algorithms.
## for h w rf, c. w rf, cart w rf, and
## chaid w rf, we used breiman's subspace size of
## 
## log (m ) +  to run these random forest algorithms.
## this number of features provided a consistent result as
## shown in fig. . in order to obtain stable results, we
## built  random forest models for each random forest
## algorithm and each dataset and present the average
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## hybrid weighted random forests for classifying very high-dimensional data
## fbis
## 
## 
## 
## re
## 
## .
## 
## .
## 
## .
## 
## .
## 
## correlation
## 
## correlation
## 
## .
## 
## .
## 
## .
## 
## .
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## chaid_w_rf
## 
## .
## 
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## 
## .
## 
## chaid_w_rf
## 
## h_rf
## c._rf
## 
## .
## 
## h_rf
## 
## .
## 
## c._rf
## 
## cart_rf
## chaid_rf
## 
## cart_rf
## 
## .
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## re
## 
## .
## 
## 
## 
## tr
## .
## 
## .
## 
## .
## 
## .
## 
## .
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## chaid_w_rf
## 
## .
## 
## correlation
## 
## correlation
## 
## .
## 
## .
## 
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## 
## .
## 
## chaid_w_rf
## h_rf
## 
## h_rf
## c._rf
## 
## .
## 
## c._rf
## 
## .
## 
## cart_rf
## 
## cart_rf
## 
## .
## 
## chaid_rf
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## number of features
## 
## wap
## 
## tr
## 
## .
## 
## .
## 
## .
## .
## 
## correlation
## 
## .
## 
## .
## 
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## chaid_w_rf
## 
## .
## 
## correlation
## 
## .
## 
## .
## 
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## chaid_w_rf
## 
## .
## 
## h_rf
## 
## h_rf
## 
## c._rf
## 
## .
## 
## c._rf
## 
## cart_rf
## 
## cart_rf
## 
## .
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## chaid_rf
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## number of features
## 
## las
## 
## las
## 
## .
## 
## .
## 
## .
## 
## .
## 
## .
## 
## .
## 
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## 
## .
## 
## chaid_w_rf
## 
## .
## 
## correlation
## 
## correlation
## 
## .
## 
## .
## 
## .
## 
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## chaid_w_rf
## 
## .
## 
## h_rf
## 
## h_rf
## c._rf
## 
## .
## 
## c._rf
## 
## .
## 
## cart_rf
## 
## cart_rf
## chaid_rf
## 
## chaid_rf
## 
## .
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## figure . correlation changes against the number of features in the subspace on the  high dimensional datasets
## 
## results, noting that the range of values are less than
## . and the hybrid trees are always more accurate.
## the comparison results of classification performance
## of eleven methods are shown in table .
## the
## performance is estimated using test accuracy (acc),
## 
## micro f (mic), and macro f (mac). boldface
## denotes best results between eleven classification
## methods.
## while the improvement is often quite
## small, there is always an improvement demonstrated.
## we observe that our proposed method (h w rf)
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## 
## 
## baoxun xu, joshua zhexue huang, graham williams, yunming ye
## fbis
## 
## .
## 
## re
## .
## 
## .
## 
## log (m)+
## 
## .
## 
## 
## 
## .
## 
## .
## 
## .
## 
## c._w_rf
## 
## .
## 
## .
## 
## 
## 
## h_w_rf
## 
## c s
## 
## c s
## 
## 
## 
## .
## 
## cart_w_rf
## 
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## 
## chaid_w_rf
## 
## .
## 
## chaid_w_rf
## 
## .
## 
## h_rf
## 
## h_rf
## 
## log (m)+
## 
## 
## c._rf
## 
## .
## 
## c._rf
## 
## .
## 
## cart_rf
## 
## cart_rf
## 
## chaid_rf
## 
## .
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## 
## 
## 
## 
## 
## 
## 
## 
## .
## 
## .
## 
## log (m)+
## 
## .
## 
## .
## 
## 
## 
## c s
## 
## h_w_rf
## 
## 
## 
## .
## 
## .
## 
## 
## 
## 
## 
## tr
## 
## .
## 
## .
## 
## c s
## 
## 
## 
## number of features
## 
## re
## 
## .
## 
## 
## 
## c._w_rf
## 
## .
## 
## .
## 
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## 
## cart_w_rf
## 
## chaid_w_rf
## 
## .
## 
## chaid_w_rf
## 
## .
## 
## h_rf
## 
## h_rf
## 
## c._rf
## 
## log (m)+
## 
## .
## 
## 
## 
## c._rf
## 
## .
## 
## cart_rf
## 
## cart_rf
## 
## chaid_rf
## 
## .
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## number of features
## 
## wap
## 
## tr
## 
## 
## 
## .
## 
## 
## 
## log (m)+
## 
## log (m)+
## 
## 
## .
## 
## 
## 
## 
## 
## c._w_rf
## 
## c s
## 
## 
## 
## c s
## 
## h_w_rf
## 
## 
## 
## 
## 
## .
## 
## 
## 
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## 
## cart_w_rf
## .
## 
## chaid_w_rf
## 
## 
## 
## chaid_w_rf
## h_rf
## 
## h_rf
## c._rf
## 
## 
## 
## c._rf
## 
## .
## 
## cart_rf
## 
## cart_rf
## chaid_rf
## 
## chaid_rf
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## number of features
## 
## las
## 
## las
## .
## 
## .
## 
## .
## .
## .
## .
## .
## .
## 
## c._w_rf
## cart_w_rf
## 
## .
## 
## 
## 
## h_rf
## 
## .
## 
## c._rf
## cart_w_rf
## 
## .
## 
## chaid_rf
## 
## 
## 
## 
## 
## 
## 
## 
## 
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## 
## .
## 
## chaid_w_rf
## 
## log (m)+
## 
## 
## 
## c s
## 
## c s
## 
## 
## 
## .
## h_w_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## chaid_w_rf
## 
## log (m)+
## 
## 
## .
## 
## h_rf
## c._rf
## 
## .
## 
## cart_rf
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## figure . c s changes against the number of features in the subspace on the  high dimensional datasets
## 
## outperformed the other classification methods in all
## datasets.
## 
## .
## 
## conclusions
## 
## in this paper, we presented a hybrid weighted random
## forest algorithm by simultaneously using a feature
## weighting method and a hybrid forest method to classify
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## hybrid weighted random forests for classifying very high-dimensional data
## fbis
## 
## .
## 
## 
## 
## re
## 
## .
## 
## .
## .
## .
## 
## .
## 
## .
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## 
## accuracy
## 
## accuracy
## 
## .
## 
## chaid_w_rf
## h_rf
## 
## .
## 
## .
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## 
## .
## 
## chaid_w_rf
## h_rf
## 
## .
## 
## c._rf
## 
## c._rf
## 
## log (m)+
## 
## cart_rf
## 
## 
## 
## .
## 
## cart_rf
## 
## log (m)+
## 
## .
## 
## 
## 
## chaid_rf
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## re
## 
## tr
## 
## .
## 
## .
## .
## 
## .
## 
## log (m)+
## 
## .
## 
## .
## 
## 
## 
## .
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## 
## .
## 
## accuracy
## 
## accuracy
## 
## .
## 
## chaid_w_rf
## 
## .
## 
## h_w_rf
## 
## .
## 
## c._w_rf
## cart_w_rf
## 
## .
## 
## chaid_w_rf
## h_rf
## 
## h_rf
## 
## .
## 
## log (m)+
## 
## .
## 
## c._rf
## 
## c._rf
## 
## 
## 
## cart_rf
## 
## cart_rf
## 
## .
## 
## .
## 
## chaid_rf
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## number of features
## 
## wap
## 
## .
## 
## 
## 
## tr
## 
## .
## 
## .
## 
## .
## 
## .
## 
## .
## 
## .
## 
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## 
## accuracy
## 
## accuracy
## 
## .
## 
## .
## 
## .
## 
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## chaid_w_rf
## 
## chaid_w_rf
## 
## .
## 
## h_rf
## 
## log (m)+
## 
## 
## .
## 
## h_rf
## 
## c._rf
## 
## .
## 
## cart_rf
## 
## .
## 
## c._rf
## 
## log (m)+
## 
## 
## .
## 
## cart_rf
## chaid_rf
## 
## chaid_rf
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## number of features
## 
## las
## 
## .
## 
## 
## 
## las
## 
## .
## 
## .
## .
## .
## 
## accuracy
## 
## .
## h_w_rf
## c._w_rf
## 
## .
## 
## cart_w_rf
## chaid_w_rf
## 
## .
## 
## accuracy
## 
## .
## 
## .
## 
## .
## 
## h_w_rf
## c._w_rf
## cart_w_rf
## 
## .
## 
## chaid_w_rf
## h_rf
## 
## h_rf
## 
## log (m)+
## 
## .
## 
## c._rf
## 
## 
## 
## c._rf
## 
## log (m)+
## 
## .
## 
## 
## 
## cart_rf
## 
## cart_rf
## chaid_rf
## 
## chaid_rf
## 
## .
## 
## .
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## 
## number of features
## 
## figure . test accuracy changes against the number of features in the subspace on the  high dimensional datasets
## 
## high dimensional data. our algorithm not only retains
## a small subspace size (breiman's formula log (m ) + 
## for determining the subspace size) to create accurate
## random forest models, but also effectively reduces
## the upper bound of the generalization error and
## 
## improves classification performance. from the results of
## experiments on various high dimensional datasets, the
## random forest generated by our new method is superior
## to other classification methods. we can use the default
## log (m ) +  subspace size and generally guarantee
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## 
## 
## baoxun xu, joshua zhexue huang, graham williams, yunming ye
## 
## table . the comparison of results
## datasets
## dataset
## fbis
## measures
## acc
## mic
## svm
## . .
## knn
## .
## .
## nb
## . .
## h rf
## . .
## c. rf
## . .
## cart rf
## . .
## chaid rf
## . .
## h w rf
## . .
## c. w rf
## . .
## cart w rf
## . .
## chaid w rf
## . .
## dataset
## wap
## measures
## acc
## mic
## svm
## .
## .
## knn
## . .
## nb
## . .
## h rf
## . .
## c. rf
## . .
## cart rf
## . .
## chaid rf
## . .
## h w rf
## . .
## c. w rf
## . .
## cart w rf
## .
## .
## chaid w rf
## . .
## 
## (best accuracy, micro f, and macro f results) of the eleven methods on the 
## re
## mic
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## tr
## mac
## acc
## mic
## . . .
## . . .
## . . .
## . . .
## . . .
## .
## . .
## . .
## .
## . . .
## . . .
## .
## .
## .
## .
## .
## .
## 
## mac
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## 
## acc
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## 
## to always produce the best models, on a variety of
## measures, by using the hybrid weighted random forest
## algorithm.
## acknowledgements
## this research is supported in part by nsfc under
## grant no., and shenzhen new industry development fund under grant no.cxba
## references
## [] breiman, l. () random forests. machine learning,
## , -.
## [] ho, t. () random subspace method for constructing decision forests. ieee transactions on pattern
## analysis and machine intelligence, , -.
## [] quinlan, j. () c.: programs for machine
## learning. morgan kaufmann.
## [] breiman, l. () classification and regression trees.
## chapman & hall crc.
## [] breiman, l. () bagging predictors.
## machine
## learning, , -.
## [] ho, t. () random decision forests. proceedings
## of the third international conference on document
## analysis and recognition, pp. -. ieee.
## [] dietterich, t. () an experimental comparison of
## three methods for constructing ensembles of decision
## trees: bagging, boosting, and randomization. machine
## learning, , -.
## 
## mac
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## mac
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## 
## re
## mic
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## las
## acc
## mic
## .
## .
## . .
## . .
## .
## .
## . .
## . .
## .
## .
## . .
## . .
## . .
## . .
## acc
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## 
## tr
## mic
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## las
## mac
## acc
## mic
## . . .
## . . .
## .
## .
## .
## .
## . .
## .
## . .
## .
## .
## .
## . . .
## . . .
## . . .
## . .
## .
## .
## . .
## mac
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## 
## acc
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## 
## mac
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## mac
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## .
## 
## [] banfield, r., hall, l., bowyer, k., and kegelmeyer, w.
## () a comparison of decision tree ensemble creation
## techniques. ieee transactions on pattern analysis
## and machine intelligence, , -.
## 
## [] robnik-sikonja,
## m. () improving random forests.
## proceedings of the th european conference on
## machine learning, pp. -. springer.
## [] ho, t. () c. decision forests. proceedings of
## the fourteenth international conference on pattern
## recognition, pp. -. ieee.
## [] dietterrich, t. () machine learning research: four
## current direction. artificial intelligence magzine, ,
## -.
## [] amaratunga, d., cabrera, j., and lee, y. ()
## enriched random forests. bioinformatics, , -
## .
## [] ye, y., li, h., deng, x., and huang, j. ()
## feature weighting random forest for detection of hidden
## web search interfaces. the journal of computational
## linguistics and chinese language processing, , -
## .
## [] xu, b., huang, j., williams, g., wang, q., and
## ye, y. () classifying very high-dimensional data
## with random forests built from small subspaces.
## international journal of data warehousing and
## mining, , -.
## [] xu, b., huang, j., williams, g., li, j., and ye, y.
## () hybrid random forests: advantages of mixed
## trees in classifying text data. proceedings of the th
## pacific-asia conference on knowledge discovery and
## data mining. springer.
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????
## 
## hybrid weighted random forests for classifying very high-dimensional data
## [] biggs, d., de ville, b., and suen, e. () a method
## of choosing multiway partitions for classification and
## decision trees. journal of applied statistics, , -.
## [] ture, m., kurt, i., turhan kurum, a., and ozdamar,
## k. () comparing classification techniques for
## predicting essential hypertension. expert systems with
## applications, , -.
## [] begum, n., m.a., f., and ren, f. () automatic text summarization using support vector machine.
## international journal of innovative computing, information and control, , -.
## [] chen, j., huang, h., tian, s., and qu, y. ()
## feature selection for text classification with naive
## bayes. expert systems with applications, , -
## .
## [] tan, s. () neighbor-weighted k-nearest neighbor
## for unbalanced text corpus.
## expert systems with
## applications, , -.
## [] pearson, k. () on the theory of contingency and
## its relation to association and normal correlation.
## cambridge university press.
## [] yang, y. and liu, x. () a re-examination of
## text categorization methods. proceedings of the th
## international conference on research and development
## in information retrieval, pp. -. acm.
## [] han, e. and karypis, g. () centroid-based
## document classification: analysis and experimental
## results. proceedings of the th european conference on
## principles of data mining and knowledge discovery,
## pp. -. springer.
## [] trec.
## ()
## text
## retrieval
## conference,
## http:  trec.nist.gov.
## [] lewis,
## d.
## ()
## reuters-
## text
## categorization
## test
## collection
## distribution
## .,
## http:  www.research.att.com  lewis.
## [] han, e., boley, d., gini, m., gross, r., hastings,
## k., karypis, g., kumar, v., mobasher, b., and
## moore, j. () webace: a web agent for document
## categorization and exploration. proceedings of the nd
## international conference on autonomous agents, pp.
## -. acm.
## [] mccallum, a. and nigam, k. () a comparison of
## event models for naive bayes text classification. aaai workshop on learning for text categorization, pp. -
## .
## [] witten, i., frank, e., and hall, m. () data mining:
## practical machine learning tools and techniques.
## morgan kaufmann.
## 
## the computer journal, vol. ??,
## 
## no. ??,
## 
## ????

Numbers may or may not be relevant to our analyses. This transform can remove numbers simply.



Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.