## 21.10 Conversion to Lower Case

`<- tm_map(docs, content_transformer(tolower)) docs `

`inspect(docs[16])`

```
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## hwrf12.txt
## hybrid weighted random forests for\nclassifying very high-dimensional data\nbaoxun xu1 , joshua zhexue huang2 , graham williams2 and\nyunming ye1\n1\n\ndepartment of computer science, harbin institute of technology shenzhen graduate\nschool, shenzhen 518055, china\n2\nshenzhen institutes of advanced technology, chinese academy of sciences, shenzhen\n518055, china\nemail: amusing002 gmail.com\nrandom forests are a popular classification method based on an ensemble of a\nsingle type of decision trees from subspaces of data. in the literature, there\nare many different types of decision tree algorithms, including c4.5, cart, and\nchaid. each type of decision tree algorithm may capture different information\nand structure. this paper proposes a hybrid weighted random forest algorithm,\nsimultaneously using a feature weighting method and a hybrid forest method to\nclassify very high dimensional data. the hybrid weighted random forest algorithm\ncan effectively reduce subspace size and improve classification performance\nwithout increasing the error bound. we conduct a series of experiments on eight\nhigh dimensional datasets to compare our method with traditional random forest\nmethods and other classification methods. the results show that our method\nconsistently outperforms these traditional methods.\nkeywords: random forests; hybrid weighted random forest; classification; decision tree;\n\n1.\n\nintroduction\n\nrandom forests [1, 2] are a popular classification\nmethod which builds an ensemble of a single type\nof decision trees from different random subspaces of\ndata. the decision trees are often either built using\nc4.5 [3] or cart [4], but only one type within\na single random forest. in recent years, random\nforests have attracted increasing attention due to\n(1) its competitive performance compared with other\nclassification methods, especially for high-dimensional\ndata, (2) algorithmic intuitiveness and simplicity, and\n(3) its most important capability - "ensemble" using\nbagging [5] and stochastic discrimination [2].\nseveral methods have been proposed to grow random\nforests from subspaces of data [1, 2, 6, 7, 8, 9, 10]. in\nthese methods, the most popular forest construction\nprocedure was proposed by breiman [1] to first use\nbagging to generate training data subsets for building\nindividual trees.\na subspace of features is then\nrandomly selected at each node to grow branches of\na decision tree. the trees are then combined as an\nensemble into a forest. as an ensemble learner, the\nperformance of a random forest is highly dependent\non two factors: the performance of each tree and the\ndiversity of the trees in the forests [11]. breiman\nformulated the overall performance of a set of trees as\nthe average strength and proved that the generalization\n\nerror of a random forest is bounded by the ratio of the\naverage correlation between trees divided by the square\nof the average strength of the trees.\nfor very high dimensional data, such as text data,\nthere are usually a large portion of features that are\nuninformative to the classes. during this forest building\nprocess, informative features would have the large\nchance to be missed, if we randomly select a small\nsubspace (breiman suggested selecting log2 (m ) + 1\nfeatures in a subspace, where m is the number of\nindependent features in the data) from high dimensional\ndata [12]. as a result, weak trees are created from these\nsubspaces, the average strength of those trees is reduced\nand the error bound of the random forest is enlarged.\ntherefore, when a large proportion of such "weak"\ntrees are generated in a random forest, the forest has a\nlarge likelihood to make a wrong decision which mainly\nresults from those "weak" trees' classification power.\nto address this problem, we aim to optimize decision\ntrees of a random forest by two strategies. one\nstraightforward strategy is to enhance the classification\nperformance of individual trees by a feature weighting\nmethod for subspace sampling [12, 13, 14]. in this\nmethod, feature weights are computed with respect\nto the correlations of features to the class feature\nand regarded as the probabilities of the feature to\nbe selected in subspaces. this method obviously\nincreases the classification performance of individual\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\n2\n\nbaoxun xu, joshua zhexue huang, graham williams, yunming ye\n\ntrees because the subspaces will be biased to contain\nmore informative features. however, the chance of more\ncorrelated trees is also increased since the features with\nlarge weights are likely to be repeatedly selected.\nthe second strategy is more straightforward: use\nseveral different types of decision trees for each training\ndata subset, to increase the diversity of the trees,\nand then select the optimal tree as the individual\ntree classifier in the random forest model. the work\npresented here extends the algorithm developed in [15].\nspecifically, we build three different types of tree\nclassifiers (c4.5, cart, and chaid [16, 17]) for each\ntraining data subset. we then evaluate the performance\nof the three classifiers and select the best tree. in\nthis way, we build a hybrid random forest which may\ninclude different types of decision trees in the ensemble.\nthe added diversity of the decision trees can effectively\nimprove the accuracy of each tree in the forest, and\nhence the classification performance of the ensemble.\nhowever, when we use this method to build the best\nrandom forest model for classifying high dimensional\ndata, we can not be sure of what subspace size is best.\nin this paper, we propose a hybrid weighted random\nforest algorithm by simultaneously using a new feature\nweighting method together with the hybrid random\nforest method to classify high dimensional data. in\nthis new random forest algorithm, we calculate feature\nweights and use weighted sampling to randomly select\nfeatures for subspaces at each node in building different\ntypes of trees classifiers (c4.5, cart, and chaid) for\neach training data subset, and select the best tree as\nthe individual tree in the final ensemble model.\nexperiments were performed on 8 high dimensional\ntext datasets with dimensions ranging from 2000 to\n13195. we compared the performance of eight random\nforest methods and well-known classification methods:\nc4.5 random forest, cart random forest, chaid\nrandom forest, hybrid random forest, c4.5 weighted\nrandom forest, cart weighted random forest, chaid\nweighted random forest, hybrid weighted random\nforest, support vector machines [18], naive bayes [19],\nand k-nearest neighbors [20].\nthe experimental\nresults show that our hybrid weighted random forest\nachieves improved classification performance over the\nten competitive methods.\nthe remainder of this paper is organized as follows.\nin section 2, we introduce a framework for building\na hybrid weighted random forest, and describe a new\nrandom forest algorithm. section 3 summarizes four\nmeasures to evaluate random forest models. we present\nexperimental results on 8 high dimensional text datasets\nin section 4. section 5 contains our conclusions.\n\ntable 1. contingency table of input feature a and class\nfeature y\ny = y1 . . .\ny = yj . . .\ny = yq total\na = a1\n11\n...\n1j\n...\n1q\n1*\n..\n..\n..\n..\n..\n..\n.\n.\n...\n.\n.\n.\n.\na = ai\ni1\n...\nij\n...\niq\ni*\n..\n..\n..\n..\n..\n..\n...\n.\n.\n.\n.\n.\n.\na = ap\np1\n...\npj\n...\npq\np*\ntotal\n*1\n...\n*j\n...\n*q\n\n\ngeneral framework for building hybrid random forests.\nby integrating these two methods, we propose a novel\nhybrid weighted random forest algorithm.\n2.1.\n\nlet y be the class (or target) feature with q distinct\nclass labels yj for j = 1, * * * , q. for the purposes of\nour discussion we consider a single categorical feature\na in dataset d with p distinct category values. we\ndenote the distinct values by ai for i = 1, * * * , p.\nnumeric features can be discretized into p intervals with\na supervised discretization method.\nassume d has val objects. the size of the subset of\nd satisfying the condition that a = ai and y = yj is\ndenoted by ij . considering all combinations of the\ncategorical values of a and the labels of y , we can\nobtain a contingency table [21] of a against y as shown\nin table 1. the far right column contains the marginal\ntotals for feature a:\n\nhybrid\nforests\n\nweighted\n\nrandom\n\nin this section, we first introduce a feature weighting\nmethod for subspace sampling. then we present a\n\nq\n\n\ni. =\n\nij\n\nfor i = 1, * * * , p\n\n(1)\n\nj=1\n\nand the bottom row is the marginal totals for class\nfeature y :\n.j =\n\np\n\n\nij\n\nfor j = 1, * * * , q\n\n(2)\n\ni=1\n\nthe grand total (the total number of samples) is in\nthe bottom right corner:\n=\n\nq \np\n\n\nij\n\n(3)\n\nj=1 i=1\n\ngiven a training dataset d and feature a we first\ncompute the contingency table. the feature weights are\nthen computed using the two methods to be discussed\nin the following subsection.\n2.2.\n\n2.\n\nnotation\n\nfeature weighting method\n\nin this subsection, we give the details of the feature\nweighting method for subspace sampling in random\nforests. consider an m-dimensional feature space\n{a1 , a2 , . . . , am }. we present how to compute the\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\nhybrid weighted random forests for classifying very high-dimensional data\nweights {w1 , w2 , . . . , wm } for every feature in the space.\nthese weights are then used in the improved algorithm\nto grow each decision tree in the random forest.\n2.2.1. feature weight computation\nthe weight of feature a represents the correlation\nbetween the values of feature a and the values of the\nclass feature y . a larger weight will indicate that the\nclass labels of objects in the training dataset are more\ncorrelated with the values of feature a, indicating that\na is more informative to the class of objects. thus it\nis suggested that a has a stronger power in predicting\nthe classes of new objects.\nin the following, we propose to use the chi-square\nstatistic to compute feature weights because this\nmethod can quantify the correspondence between two\ncategorical variables.\ngiven the contingency table of an input feature a and\nthe class feature y of dataset d, the chi-square statistic\nof the two features is computed as:\ncorr(a, y ) =\n\nq\np \n\n(ij - tij )2\ntij\ni=1 j=1\n\n(4)\n\nwhere ij is the observed frequency from the\ncontingency table and tij is the expected frequency\ncomputed as\ni* x *j\ntij =\n\n\n(5)\n\nthe larger the measure corr(a, y ), the more\ninformative the feature a is in predicting class y .\n2.2.2. normalized feature weight\nin practice, feature weights are normalized for feature\nsubspace sampling. we use corr(a, y ) to measure the\ninformativeness of these features and consider them\nas feature weights. however, to treat the weights as\nprobabilities of features, we normalize the measures to\nensure the sum of the normalized feature weights is\nequal to 1. let corr(ai , y ) (1 i m ) be the set\nof m feature measures. we compute the normalized\nweights as\n\ncorr(ai , y )\nwi = n \n(6)\ni=1 corr(ai , y )\nhere, we use the square root to smooth the values of\nthe measures. wi can be considered as the probability\nthat feature ai is randomly sampled in a subspace. the\nmore informative a feature is, the larger the weight and\nthe higher the probability of the feature being selected.\n\ndiversity is commonly obtained by using bagging and\nrandom subspace sampling. we introduce a further\nelement of diversity by using different types of trees.\nconsidering an analogy with forestry, the different data subsets from bagging represent the "soil structures." different decision tree algorithms represent "different tree species". our approach has two key aspects:\none is to use three types of decision tree algorithms to\ngenerate three different tree classifiers for each training data subset; the other is to evaluate the accuracy\nof each tree as the measure of tree importance. in this\npaper, we use the out-of-bag accuracy to assess the importance of a tree.\nfollowing breiman [1], we use bagging to generate\na series of training data subsets from which we build\ntrees. for each tree, the data subset used to grow\nthe tree is called the "in-of-bag" (iob) data and the\nremaining data subset is called the "out-of-bag" (oob)\ndata. since oob data is not used for building trees\nwe can use this data to objectively evaluate each tree's\naccuracy and importance. the oob accuracy gives an\nunbiased estimate of the true accuracy of a model.\ngiven n instances in a training dataset d and a tree\nclassifier hk (iobk ) built from the k'th training data\nsubset iobk , we define the oob accuracy of the tree\nhk (iobk ), for di d, as:\nn\noobacck =\n\nframework for building a hybrid random\nforest\n\nas an ensemble learner, the performance of a random\nforest is highly dependent on two factors: the diversity\namong the trees and the accuracy of each tree [11].\n\ni=1\n\ni(hk (di ) = yi ; di oobk )\nn\ni=1 i(di oobk )\n\n(7)\n\nwhere i(.) is an indicator function. the larger the\noobacck , the better the classification quality of a tree.\nwe use the out-of-bag data subset oobi to calculate\nthe out-of-bag accuracies of the three types of trees\n(c4.5, cart and chaid) with evaluation values e1 ,\ne2 and e3 respectively.\nfig. 1 illustrates the procedure for building a hybrid\nrandom forest model. firstly, a series of iob oob\ndatasets are generated from the entire training dataset\nby bagging. then, three types of tree classifiers (c4.5,\ncart and chaid) are built using each iob dataset.\nthe corresponding oob dataset is used to calculate the\noob accuracies of the three tree classifiers. finally,\nwe select the tree with the highest oob accuracy as\nthe final tree classifier, which is included in the hybrid\nrandom forest.\nbuilding a hybrid random forest model in this\nway will increase the diversity among the trees.\nthe classification performance of each individual tree\nclassifier is also maximized.\n2.4.\n\n2.3.\n\n3\n\ndecision tree algorithms\n\nthe core of our approach is the diversity of decision\ntree algorithms in our random forest. different decision\ntree algorithms grow structurally different trees from\nthe same training data. selecting a good decision tree\nalgorithm to grow trees for a random forest is critical\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\n4\n\nbaoxun xu, joshua zhexue huang, graham williams, yunming ye\nthe difference lies in the way to split a node, such\nas the split functions and binary branches or multibranches. in this work we use these different decision\ntree algorithms to build a hybrid random forest.\n\n2.5.\n\nfigure 1. the hybrid random forests framework.\n\nfor the performance of the random forest. few studies\nhave considered how different decision tree algorithms\naffect a random forest. we do so in this paper.\nthe common decision tree algorithms are as follows:\nclassification trees 4.5 (c4.5) is a supervised\nlearning classification algorithm used to construct\ndecision trees. given a set of pre-classified objects, each\ndescribed by a vector of attribute values, we construct\na mapping from attribute values to classes. c4.5 uses\na divide-and-conquer approach to grow decision trees.\nbeginning with the entire dataset, a tree is constructed\nby considering each predictor variable for dividing the\ndataset. the best predictor is chosen at each node\nusing a impurity or diversity measure. the goal is\nto produce subsets of the data which are homogeneous\nwith respect to the target variable. c4.5 selects the test\nthat maximizes the information gain ratio (igr) [3].\nclassification and regression tree (cart) is\na recursive partitioning method that can be used for\nboth regression and classification. the main difference\nbetween c4.5 and cart is the test selection and\nevaluation process.\nchi-squared automatic interaction detector\n(chaid) method is based on the chi-square test of\nassociation. a chaid decision tree is constructed\nby repeatedly splitting subsets of the space into two\nor more nodes. to determine the best split at any\nnode, any allowable pair of categories of the predictor\nvariables is merged until there is no statistically\nsignificant difference within the pair with respect to the\ntarget variable [16, 17].\nfrom these decision tree algorithms, we can see that\n\nhybrid weighted random forest algorithm\n\nin this subsection we present a hybrid weighted\nrandom forest algorithm by simultaneously using the\nfeature weights and a hybrid method to classify high\ndimensional data. the benefits of our algorithm has\ntwo aspects: firstly, compared with hybrid forest\nmethod [15], we can use a small subspace size to\ncreate accurate random forest models.\nsecondly,\ncompared with building a random forest using feature\nweighting [14], we can use several different types of\ndecision trees for each training data subset to increase\nthe diversities of trees. the added diversity of the\ndecision trees can effectively improve the classification\nperformance of the ensemble model. the detailed steps\nare introduced in algorithm 1.\ninput parameters to algorithm 1 include a training\ndataset d, the set of features a, the class feature y ,\nthe number of trees in the random forest k and the\nsize of subspaces m. the output is a random forest\nmodel m . lines 9-16 form the loop for building k\ndecision trees. in the loop, line 10 samples the training\ndata d by sampling with replacement to generate an\nin-of-bag data subset iobi for building a decision tree.\nline 11-14 build three types of tree classifiers (c4.5,\ncart, and chaid). in this procedure, line 12 calls\nthe function createt reej () to build a tree classifier.\nline 13 calculates the out-of-bag accuracy of the tree\nclassifier. after this procedure, line 15 selects the tree\nclassifier with the maximum out-of-bag accuracy. k\ndecision tree trees are thus generated to form a hybrid\nweighted random forest model m .\ngenerically, function createt reej () first creates a\nnew node. then, it tests the stopping criteria to decide\nwhether to return to the upper node or to split this\nnode. if we choose to split this node, then the feature\nweighting method is used to randomly select m features\nas the subspace for node splitting. these features\nare used as candidates to generate the best split to\npartition the node. for each subset of the partition,\ncreatet reej () is called again to create a new node under\nthe current node. if a leaf node is created, it returns to\nthe parent node. this recursive process continues until\na full tree is generated.\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\nhybrid weighted random forests for classifying very high-dimensional data\nalgorithm 1 new random forest algorithm\n1: input:\n2: - d : the training dataset,\n3: - a : the features space {a1 , a2 , ..., am },\n4: - y : the class features space {y1 , y2 , ..., yq },\n5: - k : the number of trees,\n6: - m : the size of subspaces.\n7: output: a random forest m ;\n8: method:\n9: for i = 1 to k do\n10:\ndraw a bootstrap sample in-of-bag data subset\niobi and out-of-bag data subset oobi from\ntraining dataset d;\n11:\nfor j = 1 to 3 do\n12:\nhi,j (iobi ) = createt reej ();\nuse out-of-bag data subset oobi to calculate\n13:\nthe out-of-bag accuracy oobacci, j of the tree\nclassifier hi,j (iobi ) by equation(1);\n14:\nend for\n15:\nselect hi (iobi ) with the highest out-of-bag\naccuracy oobacci as optimal tree i;\n16: end for\n17: combine\nthe\nk\ntree\nclassifiers\nh1 (iob1 ), h2 (iob2 ), ..., hk (iobk ) into a random\nforest m ;\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n27:\n28:\n29:\n30:\n31:\n32:\n\n3.\n\nfunction createtree()\ncreate a new node n ;\nif stopping criteria is met then\nreturn n as a leaf node;\nelse\nfor j = 1 to m do\ncompute\nthe\ninformativeness\nmeasure\ncorr(aj , y ) by equation (4);\nend for\ncompute feature weights {w1 , w2 , ..., wm } by\nequation (6);\nuse the feature weighting method to randomly\nselect m features;\nuse these m features as candidates to generate\nthe best split for the node to be partitioned;\ncall createtree() for each split;\nend if\nreturn n ;\nevaluation measures\n\nin this paper, we use five measures, i.e., strength,\ncorrelation, error bound c s2 , test accuracy, and f1\nmetric, to evaluate our random forest models. strength\nmeasures the collective performance of individual trees\nin a random forest and the correlation measures the\ndiversity of the trees. the ratio of the correlation\nover the square of the strength c s2 indicates the\ngeneralization error bound of the random forest model.\nthese three measures were introduced in [1]. the\naccuracy measures the performance of a random forest\nmodel on unseen test data. the f1 metric is a\n\n5\n\ncommonly used measure of classification performance.\n3.1.\n\nstrength and correlation measures\n\nwe follow breiman's method described in [1] to\ncalculate the strength, correlation and the ratio c s2 .\nfollowing breiman's notation, we denote strength as\ns and correlation as . let hk (iobk ) be the kth\ntree classifier grown from the kth training data iobk\nsampled from d with replacement.\nassume the\nrandom forest model contains k trees. the out-of-bag\nproportion of votes for di d on class j is\nk\ni(hk (di ) = j; di \n iobk )\nq(di , j) = k=1k\n(8)\n iobk )\nk=1 i(di \nthis is the number of trees in the random forest\nwhich are trained without di and classify di into class\nj, divided by the number of training datasets not\ncontaining di .\nthe strength s is computed as:\n1\n(q(di , yi ) - maxj=yi q(di , j))\nn i=1\nn\n\ns=\n\n(9)\n\nwhere n is the number of objects in d and yi indicates\nthe true class of di .\nthe correlation is computed as:\nn\n1\n2\n2\ni=1 (q(di , yi ) - maxj=yi q(di , j)) - s\nn\n(10)\n =\n\n\nk\n1\n(k\nk + (pk - pk )2 )2\nk=1 pk + p\nwhere\n\nn\npk =\n\ni=1\n\ni(hk (di ) = yi ; di \n iobk )\nn\n iobk )\ni=1 i(di \n\n(11)\n\nand\nn\npk =\n\ni=1\n\ni(hk (di ) = j(di , y ); di \n iobk )\nn\ni(d\n\n \niob\n)\ni\nk\ni=1\n\n(12)\n\nwhere\nj(di , y ) = argmaxj=yi q(d, j)\n\n(13)\n\nis the class that obtains the maximal number of votes\namong all classes but the true class.\n3.2.\n\ngeneral error bound measure c s2\n\ngiven the strength and correlation, the out-of-bag\nestimate of the c s2 measure can be computed.\nan important theoretical result in breiman's method\nis the upper bound of the generalization error of the\nrandom forest ensemble that is derived as\np e (1 - s2 ) s2\n\n(14)\n\nwhere is the mean value of correlations between all\npairs of individual classifiers and s is the strength of\nthe set of individual classifiers that is estimated as the\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\n6\n\nbaoxun xu, joshua zhexue huang, graham williams, yunming ye\n\naverage accuracy of individual classifiers on d with\nout-of-bag evaluation. this inequality shows that the\ngeneralization error of a random forest is affected by\nthe strength of individual classifiers and their mutual\ncorrelations. therefore, breiman defined the c s2 ratio\nto measure a random forest as\nc s2 = s2\n\n(15)\n\nthe smaller the ratio, the better the performance of\nthe random forest. as such, c s2 gives guidance for\nreducing the generalization error of random forests.\n3.3.\n\ntest accuracy\n\nthe test accuracy measures the classification performance of a random forest on the test data set. let\ndt be a test data and yt be the class labels. given\ndi dt , the number of votes for di on class j is\nn (di , j) =\n\nk\n\n\ni(hk (di ) = j)\n\n(16)\n\ntable 2.\nsummary statistic of 8 high-dimensional\ndatasets\nname\nfeatures\ninstances\nclasses % minority\nfbis\n2000\n2463\n17\n1.54\nre0\n2886\n1504\n13\n0.73\nre1\n3758\n1657\n25\n0.6\ntr41\n7454\n878\n10\n1.03\nwap\n8460\n1560\n20\n0.32\ntr31\n10,128\n927\n7\n0.22\nla2s\n12,432\n3075\n6\n8.07\nla1s\n13,195\n3204\n6\n8.52\n\nit emphasizes the performance of a classifier on rare\ncategories. define and as follows:\n\ni =\n\nt pi\nt pi\n, i =\n(t pi + f pi )\n(t pi + f ni )\n\n(20)\n\nf 1 for each category i and the macro-averaged f1\nare computed as:\n\nk=1\n\nthe test accuracy is calculated as\nf 1i =\n1\ni(n (di , yi ) - maxj=yi n (di , j) > 0) (17)\nn i=1\n\n2i i\n, m acrof 1 =\ni + i\n\nq\ni=1\n\nq\n\nf 1i\n\n(21)\n\nn\n\nacc =\n\nwhere n is the number of objects in dt and yi indicates\nthe true class of di .\n3.4.\n\nf1 metric\n\nto evaluate the performance of classification methods\nin dealing with an unbalanced class distribution, we use\nthe f1 metric introduced by yang and liu [22]. this\nmeasure is equal to the harmonic mean of recall ()\nand precision (). the overall f1 score of the entire\nclassification problem can be computed by a microaverage and a macro-average.\nmicro-averaged f1 is computed globally over all\nclasses, and emphasizes the performance of a classifier\non common classes. define and as follows:\nq\n\nq\nt pi\ni=1 t pi\n = q i=1\n, = q\n(18)\ni=1 (t pi + f pi )\ni=1 (t pi + f ni )\nwhere q is the number of classes. t pi (true positives)\nis the number of objects correctly predicted as class i,\nf pi (false positives) is the number of objects that are\npredicted to belong to class i but do not. the microaveraged f1 is computed as:\nm icrof 1 =\n\n2\n+\n\n(19)\n\nmacro-averaged f1 is first computed locally over\neach class, and then the average over all classes is taken.\n\nthe larger the microf1 and macrof1 values are, the\nhigher the classification performance of the classifier.\n4.\n\nexperiments\n\nin this section, we present two experiments that\ndemonstrate the effectiveness of the new random\nforest algorithm for classifying high dimensional data.\nhigh dimensional datasets with various sizes and\ncharacteristics were used in the experiments. the\nfirst experiment is designed to show how our proposed\nmethod can reduce the generalization error bound\nc s2 , and improve test accuracy when the size of\nthe selected subspace is not too large. the second\nexperiment is used to demonstrate the classification\nperformance of our proposed method in comparison to\nother classification methods, i.e. svm, nb and knn.\n4.1.\n\ndatasets\n\nin the experiments, we used eight real-world high\ndimensional datasets. these datasets were selected\ndue to their diversities in the number of features, the\nnumber of instances, and the number of classes. their\ndimensionalities vary from 2000 to 13,195. instances\nvary from 878 to 3204 and the minority class rate varies\nfrom 0.22% to 8.52%. in each dataset, we randomly\nselect 70% of instances as the training dataset, and\nthe remaining data as the test dataset. detailed\ninformation of the eight datasets is listed in table 2.\nthe fbis, re0, re1, tr41, wap, tr31, la2s\nand la1s datasets are classical text classification\nbenchmark datasets which were carefully selected and\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\nhybrid weighted random forests for classifying very high-dimensional data\npreprocessed by han and karypis [23]. dataset fbis\nwas compiled from the foreign broadcast information\nservice trec-5 [24]. the datasets re0 and re1 were\nselected from the reuters-21578 text categorization test\ncollection distribution 1.0 [25]. the datasets tr41 and\ntr31 were derived from trec-5 [24], trec-6 [24],\nand trec-7 [24]. dataset wap is from the webace\nproject (wap) [26]. the datasets la2s and la1s were\nselected from the los angeles times for trec-5 [24].\nthe classes of these datasets were generated from the\nrelevance judgment provided in these collections.\n4.2.\n\nperformance comparisons between random forest methods\n\nthe purpose of this experiment was to evaluate\nthe effect of the hybrid weighted random forest\nmethod (h w rf) on strength, correlation, c s2 ,\nand test accuracy.\nthe eight high dimensional\ndatasets were analyzed and results were compared\nwith seven other random forest methods, i.e., c4.5\nrandom forest (c4.5 rf), cart random forest\n(cart rf), chaid random forest (chaid rf),\nhybrid random forest (h rf), c4.5 weighted random\nforest (c4.5 w rf), cart weighted random forest\n(cart w rf), chaid weighted random forest\n(chaid w rf). for each dataset, we ran each\nrandom forest algorithm against different sizes of the\nfeature subspaces. since the number of features in these\ndatasets was very large, we started with a subspace\nof 10 features and increased the subspace by 5 more\nfeatures each time. for a given subspace size, we built\n100 trees for each random forest model. in order to\nobtain a stable result, we built 80 random forest models\nfor each subspace size, each dataset and each algorithm,\nand computed the average values of the four measures\nof strength, correlation, c s2 , and test accuracy as the\nfinal results for comparison. the performance of the\neight random forest algorithms on the four measures\nfor each of the 8 datasets is shown in figs. 2, 3, 4, and\n5.\nfig. 2 plots the strength for the eight methods against\ndifferent subspace sizes on each of the 8 datasets.\nin the same subspace, the higher the strength, the\nbetter the result. from the curves, we can see that\nthe new algorithm (h w rf) consistently performs\nbetter than the seven other random forest algorithms.\nthe advantages are more obvious for small subspaces.\nthe new algorithm quickly achieved higher strength\nas the subspace size increases.\nthe seven other\nrandom forest algorithms require larger subspaces to\nachieve a higher strength. these results indicate that\nthe hybrid weighted random forest algorithm enables\nrandom forest models to achieve a higher strength\nfor small subspace sizes compared to the seven other\nrandom forest algorithms.\nfig. 3 plots the curves for the correlations for the\neight random forest methods on the 8 datasets. for\n\n7\n\nsmall subspace sizes, h rf, c4.5 rf, cart rf,\nand chaid rf produce higher correlations between\nthe trees on all datasets. the correlation decreases\nas the subspace size increases. for the random forest\nmodels the lower the correlation between the trees\nthen the better the final model.\nwith our new\nrandom forest algorithm (h w rf) a low correlation\nlevel was achieved with very small subspaces in all\n8 datasets. we also note that as the subspace size\nincreased the correlation level increased as well. this is\nunderstandable because as the subspace size increases,\nthe same informative features are more likely to be\nselected repeatedly in the subspaces, increasing the\nsimilarity of the decision trees. therefore, the feature\nweighting method for subspace selection works well for\nsmall subspaces, at least from the point of view of the\ncorrelation measure.\nfig. 4 shows the error bound indicator c s2 for the\neight methods on the 8 datasets. from these figures\nwe can observe that as the subspace size increases, c s2\nconsistently reduces. the behaviour indicates that a\nsubspace size larger than log2 (m )+1 benefits all eight\nalgorithms. however, the new algorithm (h w rf)\nachieved a lower level of c s2 at subspace size of\nlog2 (m ) + 1 than the seven other algorithms.\nfig. 5 plots the curves showing the accuracy of the\neight random forest models on the test datasets from\nthe 8 datasets. we can clearly see that the new random\nforest algorithm (h w rf) outperforms the seven\nother random forest algorithms in all eight data sets.\nit can be seen that the new method is more stable\nin classification performance than other methods. in\nall of these figures, it is observed that the highest test\naccuracy is often obtained with the default subspace size\nof log2 (m ) + 1. this implies that in practice, large\nsize subspaces are not necessary to grow high-quality\ntrees for random forests.\n4.3.\n\nperformance comparisons\nclassification methods\n\nwith\n\nother\n\nwe conducted a further experimental comparison\nagainst three other widely used text classification\nmethods: support vector machines (svm), naive\nbayes (nb), and k-nearest neighbor (knn). the\nsupport vector machine used a linear kernel with a\nregularization parameter of 0.03125, which was often\nused in text categorization. for naive bayes, we\nadopted the multi-variate bernoulli event model that\nis frequently used in text classification [27]. for knearest neighbor (knn), we set the number k of\nneighbors to 13. in the experiments, we used weka's\nimplementation for these three text classification\nmethods [28]. we used a single subspace size of\nfeatures in all eight datasets to run the random forest\nalgorithms. for h rf, c4.5 rf, cart rf, and\nchaid rf, we used a subspace size of 90 features in\nthe first 6 datasets (i.e., fbis, re0, re1, tr41, wap, and\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\n8\n\nbaoxun xu, joshua zhexue huang, graham williams, yunming ye\nfbis\n\nre0\n\n0.52\n\n0.52\n\n0.48\n\n0.48\n\n0.44\n\nstrength\n\nstrength\n\n0.44\n\n0.40\n\n0.40\n\nh_w_rf\n\n0.36\n\nc4.5_w_rf\ncart_w_rf\n\n0.32\n\nh_w_rf\nc4.5_w_rf\n\n0.36\n\ncart_w_rf\n\nchaid_w_rf\n\nchaid_w_rf\n\n0.32\n\nh_rf\n\n0.28\n\nh_rf\n\nc4.5_rf\n\nc4.5_rf\n\n0.28\n\ncart_rf\n\n0.24\n\ncart_rf\n\nchaid_rf\n\nchaid_rf\n\n0.24\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nnumber of features\n\nnumber of features\n\nre1\n\n0.60\n\n40\n\ntr41\n\n0.8\n\n0.55\n\n0.7\n\n0.50\n0.6\n\n0.40\n\nh_w_rf\nc4.5_w_rf\n\n0.35\n\ncart_w_rf\n\nstrength\n\nstrength\n\n0.45\n0.5\nh_w_rf\nc4.5_w_rf\n\n0.4\n\ncart_w_rf\n\nchaid_w_rf\n\n0.30\n\nchaid_w_rf\n\nh_rf\n\nh_rf\n\n0.3\n\nc4.5_rf\n\n0.25\n\nc4.5_rf\n\ncart_rf\n\ncart_rf\n\n0.2\n\nchaid_rf\n\nchaid_rf\n\n0.20\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nnumber of features\n\nnumber of features\n\nwap\n\ntr31\n\n0.9\n\n0.44\n0.8\n\n0.40\n\n0.36\n\nstrength\n\nh_w_rf\n\n0.28\n\nc4.5_w_rf\ncart_w_rf\n\n0.24\n\nstrength\n\n0.7\n\n0.32\n\n0.6\nh_w_rf\nc4.5_w_rf\n\n0.5\n\ncart_w_rf\nchaid_w_rf\n\nchaid_w_rf\nh_rf\n\n0.20\n\nh_rf\n\n0.4\n\nc4.5_rf\n\nc4.5_rf\n\ncart_rf\n\ncart_rf\n\n0.16\n\n0.3\n\nchaid_rf\n\nchaid_rf\n\n0.12\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\nla2s\n\n60\n\n70\n\n80\n\n90\n\n100\n\nla1s\n\n0.60\n\n0.55\n\n0.55\n\n0.50\n\n0.50\n\n0.45\n\n0.45\nh_w_rf\n\n0.40\n\nc4.5_w_rf\ncart_w_rf\nchaid_w_rf\n\n0.35\n\nstrength\n\nstrength\n\n50\n\nnumber of features\n\nnumber of features\n\n0.60\n\n40\n\nh_w_rf\n\n0.40\n\nc4.5_w_rf\ncart_w_rf\n\n0.35\n\nchaid_w_rf\n\nh_rf\nc4.5_rf\n\n0.30\n\nh_rf\n\n0.30\n\nc4.5_rf\n\ncart_rf\nchaid_rf\n\ncart_rf\n\n0.25\n\nchaid_rf\n\n0.25\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\n10\n\n20\n\nnumber of features\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\nnumber of features\n\nfigure 2. strength changes against the number of features in the subspace on the 8 high dimensional datasets\n\ntr31) to run the random forest algorithms, and used\na subspace size of 120 features in the last 2 datasets\n(la2s and la1s) to run these random forest algorithms.\nfor h w rf, c4.5 w rf, cart w rf, and\nchaid w rf, we used breiman's subspace size of\n\nlog2 (m ) + 1 to run these random forest algorithms.\nthis number of features provided a consistent result as\nshown in fig. 5. in order to obtain stable results, we\nbuilt 20 random forest models for each random forest\nalgorithm and each dataset and present the average\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\nhybrid weighted random forests for classifying very high-dimensional data\nfbis\n\n9\n\nre0\n\n0.216\n\n0.285\n\n0.208\n\n0.270\n\ncorrelation\n\ncorrelation\n\n0.255\n\n0.200\n\n0.240\n\n0.192\nh_w_rf\nc4.5_w_rf\n\n0.184\n\ncart_w_rf\nchaid_w_rf\n\n0.176\n\nh_w_rf\n\n0.225\n\nc4.5_w_rf\ncart_w_rf\n\n0.210\n\nchaid_w_rf\n\nh_rf\nc4.5_rf\n\n0.168\n\nh_rf\n\n0.195\n\nc4.5_rf\n\ncart_rf\nchaid_rf\n\ncart_rf\n\n0.180\n\nchaid_rf\n\n0.160\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\nnumber of features\n\n40\n\n50\n\n70\n\n80\n\n90\n\n100\n\nnumber of features\n\nre1\n\n0.27\n\n60\n\ntr41\n0.18\n\n0.26\n\n0.16\n\n0.25\n\n0.23\nh_w_rf\nc4.5_w_rf\n\n0.22\n\ncart_w_rf\nchaid_w_rf\n\n0.21\n\ncorrelation\n\ncorrelation\n\n0.24\n\n0.14\n\nh_w_rf\n\n0.12\n\nc4.5_w_rf\ncart_w_rf\n\n0.10\n\nchaid_w_rf\nh_rf\n\nh_rf\nc4.5_rf\n\n0.20\n\nc4.5_rf\n\n0.08\n\ncart_rf\n\ncart_rf\n\n0.19\n\nchaid_rf\n\nchaid_rf\n\n0.06\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nnumber of features\n\nnumber of features\n\nwap\n\ntr31\n\n0.27\n\n0.14\n\n0.26\n0.12\n\ncorrelation\n\n0.24\n\n0.23\n\nh_w_rf\nc4.5_w_rf\n\n0.22\n\ncart_w_rf\nchaid_w_rf\n\n0.21\n\ncorrelation\n\n0.25\n\n0.10\n\nh_w_rf\nc4.5_w_rf\n\n0.08\n\ncart_w_rf\nchaid_w_rf\n\n0.06\n\nh_rf\n\nh_rf\n\nc4.5_rf\n\n0.20\n\nc4.5_rf\n\ncart_rf\n\ncart_rf\n\n0.04\n\nchaid_rf\n\n0.19\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nchaid_rf\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nnumber of features\n\nnumber of features\n\nla2s\n\nla1s\n\n0.162\n\n0.165\n\n0.156\n\n0.160\n\n0.155\n\n0.150\n\nh_w_rf\n\n0.138\n\nc4.5_w_rf\ncart_w_rf\n\n0.132\n\nchaid_w_rf\n\n0.126\n\ncorrelation\n\ncorrelation\n\n0.150\n\n0.144\n\n0.145\n\nh_w_rf\nc4.5_w_rf\n\n0.140\n\ncart_w_rf\nchaid_w_rf\n\n0.135\n\nh_rf\n\nh_rf\nc4.5_rf\n\n0.120\n\nc4.5_rf\n\n0.130\n\ncart_rf\n\ncart_rf\nchaid_rf\n\nchaid_rf\n\n0.125\n\n0.114\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\n10\n\n20\n\nnumber of features\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\nnumber of features\n\nfigure 3. correlation changes against the number of features in the subspace on the 8 high dimensional datasets\n\nresults, noting that the range of values are less than\n0.005 and the hybrid trees are always more accurate.\nthe comparison results of classification performance\nof eleven methods are shown in table 3.\nthe\nperformance is estimated using test accuracy (acc),\n\nmicro f1 (mic), and macro f1 (mac). boldface\ndenotes best results between eleven classification\nmethods.\nwhile the improvement is often quite\nsmall, there is always an improvement demonstrated.\nwe observe that our proposed method (h w rf)\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\n10\n\nbaoxun xu, joshua zhexue huang, graham williams, yunming ye\nfbis\n\n3.50\n\nre0\n4.0\n\n3.15\n\nlog (m)+1\n\n3.5\n\n2\n\n2.80\n\n3.0\n\n2.45\n\nc4.5_w_rf\n\n1.75\n\n2.5\n\n2\n\nh_w_rf\n\nc s\n\nc s\n\n2\n\n2.10\n\ncart_w_rf\n\nh_w_rf\nc4.5_w_rf\n\n2.0\n\ncart_w_rf\n\nchaid_w_rf\n\n1.40\n\nchaid_w_rf\n\n1.5\n\nh_rf\n\nh_rf\n\nlog (m)+1\n2\n\nc4.5_rf\n\n1.05\n\nc4.5_rf\n\n1.0\n\ncart_rf\n\ncart_rf\n\nchaid_rf\n\n0.70\n\nchaid_rf\n\n0.5\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\nnumber of features\n\n70\n\n80\n\n90\n\n100\n\n3.5\n\n4.9\n\nlog (m)+1\n\n3.0\n\n4.2\n\n2\n\nc s\n\nh_w_rf\n\n2\n\n2.5\n\n3.5\n\n2\n\n60\n\ntr41\n\n4.0\n\n5.6\n\nc s\n\n50\n\nnumber of features\n\nre1\n\n6.3\n\n40\n\nc4.5_w_rf\n\n2.8\n\n2.0\n\nh_w_rf\nc4.5_w_rf\n\n1.5\n\ncart_w_rf\n\ncart_w_rf\n\nchaid_w_rf\n\n2.1\n\nchaid_w_rf\n\n1.0\n\nh_rf\n\nh_rf\n\nc4.5_rf\n\nlog (m)+1\n\n1.4\n\n2\n\nc4.5_rf\n\n0.5\n\ncart_rf\n\ncart_rf\n\nchaid_rf\n\n0.7\n\nchaid_rf\n\n0.0\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nnumber of features\n\nnumber of features\n\nwap\n\ntr31\n\n14\n\n1.50\n\n12\n\nlog (m)+1\n\nlog (m)+1\n2\n\n1.25\n\n2\n\n10\n\nc4.5_w_rf\n\nc s\n\n2\n\nc s\n\nh_w_rf\n\n6\n\n2\n\n1.00\n\n8\n\nh_w_rf\n\n0.75\n\nc4.5_w_rf\ncart_w_rf\n\ncart_w_rf\n0.50\n\nchaid_w_rf\n\n4\n\nchaid_w_rf\nh_rf\n\nh_rf\nc4.5_rf\n\n2\n\nc4.5_rf\n\n0.25\n\ncart_rf\n\ncart_rf\nchaid_rf\n\nchaid_rf\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n0.00\n\n100\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nnumber of features\n\nnumber of features\n\nla1s\n\nla2s\n2.2\n\n1.8\n\n2.0\n1.6\n1.8\n1.4\n1.6\n1.2\n\nc4.5_w_rf\ncart_w_rf\n\n0.8\n\n2\n\nh_rf\n\n0.6\n\nc4.5_rf\ncart_w_rf\n\n0.4\n\nchaid_rf\n\n20\n\n30\n\n40\n\n2\n\nh_w_rf\n\n1.2\n\nc4.5_w_rf\ncart_w_rf\n\n1.0\n\nchaid_w_rf\n\nlog (m)+1\n\n10\n\nc s\n\nc s\n\n2\n\n1.4\nh_w_rf\n\n1.0\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\nchaid_w_rf\n\nlog (m)+1\n2\n\n0.8\n\nh_rf\nc4.5_rf\n\n0.6\n\ncart_rf\nchaid_rf\n\n0.4\n\n130\n\n10\n\n20\n\nnumber of features\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\nnumber of features\n\nfigure 4. c s2 changes against the number of features in the subspace on the 8 high dimensional datasets\n\noutperformed the other classification methods in all\ndatasets.\n\n5.\n\nconclusions\n\nin this paper, we presented a hybrid weighted random\nforest algorithm by simultaneously using a feature\nweighting method and a hybrid forest method to classify\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\nhybrid weighted random forests for classifying very high-dimensional data\nfbis\n\n0.86\n\n11\n\nre0\n\n0.88\n\n0.84\n0.84\n0.80\n\n0.76\n\n0.80\nh_w_rf\nc4.5_w_rf\n\n0.78\n\ncart_w_rf\n\naccuracy\n\naccuracy\n\n0.82\n\nchaid_w_rf\nh_rf\n\n0.76\n\n0.72\nh_w_rf\n\n0.68\n\nc4.5_w_rf\ncart_w_rf\n\n0.64\n\nchaid_w_rf\nh_rf\n\n0.60\n\nc4.5_rf\n\nc4.5_rf\n\nlog (m)+1\n\ncart_rf\n\n2\n\n0.74\n\ncart_rf\n\nlog (m)+1\n\n0.56\n\n2\n\nchaid_rf\n\nchaid_rf\n\n0.52\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\nnumber of features\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nnumber of features\n\nre1\n\ntr41\n\n1.00\n\n0.86\n0.95\n\n0.84\n\nlog (m)+1\n\n0.82\n\n0.90\n\n2\n\n0.78\nh_w_rf\nc4.5_w_rf\n\n0.76\n\ncart_w_rf\n\n0.74\n\naccuracy\n\naccuracy\n\n0.80\n\nchaid_w_rf\n\n0.85\n\nh_w_rf\n\n0.80\n\nc4.5_w_rf\ncart_w_rf\n\n0.75\n\nchaid_w_rf\nh_rf\n\nh_rf\n\n0.72\n\nlog (m)+1\n\n0.70\n\nc4.5_rf\n\nc4.5_rf\n\n2\n\ncart_rf\n\ncart_rf\n\n0.70\n\n0.65\n\nchaid_rf\n\nchaid_rf\n\n0.68\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nnumber of features\n\nnumber of features\n\nwap\n\n0.84\n\n40\n\ntr31\n\n1.000\n\n0.81\n\n0.975\n\n0.78\n\n0.950\n\n0.75\n\nh_w_rf\nc4.5_w_rf\n\n0.69\n\ncart_w_rf\n\naccuracy\n\naccuracy\n\n0.925\n\n0.72\n\n0.900\n\nh_w_rf\nc4.5_w_rf\n\n0.875\n\ncart_w_rf\nchaid_w_rf\n\nchaid_w_rf\n\n0.66\n\nh_rf\n\nlog (m)+1\n2\n\n0.850\n\nh_rf\n\nc4.5_rf\n\n0.63\n\ncart_rf\n\n0.60\n\nc4.5_rf\n\nlog (m)+1\n2\n\n0.825\n\ncart_rf\nchaid_rf\n\nchaid_rf\n\n0.800\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n10\n\n20\n\n30\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nnumber of features\n\nnumber of features\n\nla2s\n\n0.900\n\n40\n\nla1s\n\n0.88\n\n0.885\n0.86\n0.870\n\naccuracy\n\n0.840\nh_w_rf\nc4.5_w_rf\n\n0.825\n\ncart_w_rf\nchaid_w_rf\n\n0.810\n\naccuracy\n\n0.84\n\n0.855\n\n0.82\n\nh_w_rf\nc4.5_w_rf\ncart_w_rf\n\n0.80\n\nchaid_w_rf\nh_rf\n\nh_rf\n\nlog (m)+1\n\n0.795\n\nc4.5_rf\n\n2\n\nc4.5_rf\n\nlog (m)+1\n\n0.78\n\n2\n\ncart_rf\n\ncart_rf\nchaid_rf\n\nchaid_rf\n\n0.780\n\n0.76\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\n10\n\n20\n\nnumber of features\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\n110\n\n120\n\n130\n\nnumber of features\n\nfigure 5. test accuracy changes against the number of features in the subspace on the 8 high dimensional datasets\n\nhigh dimensional data. our algorithm not only retains\na small subspace size (breiman's formula log2 (m ) + 1\nfor determining the subspace size) to create accurate\nrandom forest models, but also effectively reduces\nthe upper bound of the generalization error and\n\nimproves classification performance. from the results of\nexperiments on various high dimensional datasets, the\nrandom forest generated by our new method is superior\nto other classification methods. we can use the default\nlog2 (m ) + 1 subspace size and generally guarantee\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\n12\n\nbaoxun xu, joshua zhexue huang, graham williams, yunming ye\n\ntable 3. the comparison of results\ndatasets\ndataset\nfbis\nmeasures\nacc\nmic\nsvm\n0.834 0.799\nknn\n0.78\n0.752\nnb\n0.776 0.74\nh rf\n0.853 0.816\nc4.5 rf\n0.836 0.806\ncart rf\n0.829 0.797\nchaid rf\n0.842 0.805\nh w rf\n0.856 0.825\nc4.5 w rf\n0.841 0.809\ncart w rf\n0.835 0.805\nchaid w rf\n0.839 0.815\ndataset\nwap\nmeasures\nacc\nmic\nsvm\n0.81\n0.772\nknn\n0.752 0.622\nnb\n0.797 0.742\nh rf\n0.815 0.805\nc4.5 rf\n0.797 0.795\ncart rf\n0.793 0.793\nchaid rf\n0.805 0.805\nh w rf\n0.815 0.805\nc4.5 w rf\n0.805 0.795\ncart w rf\n0.8\n0.792\nchaid w rf\n0.811 0.795\n\n(best accuracy, micro f1, and macro f1 results) of the eleven methods on the 8\nre0\nmic\n0.795\n0.752\n0.741\n0.82\n0.802\n0.798\n0.8\n0.825\n0.815\n0.81\n0.812\ntr31\nmac\nacc\nmic\n0.663 0.955 0.907\n0.622 0.905 0.82\n0.559 0.925 0.832\n0.735 0.965 0.925\n0.732 0.962 0.902\n0.73\n0.958 0.892\n0.732 0.96\n0.9\n0.735 0.965 0.925\n0.732 0.962 0.911\n0.73\n0.96\n0.902\n0.73\n0.96\n0.905\n\nmac\n0.76\n0.722\n0.706\n0.816\n0.806\n0.787\n0.805\n0.82\n0.815\n0.81\n0.812\n\nacc\n0.804\n0.779\n0.784\n0.845\n0.836\n0.826\n0.832\n0.855\n0.845\n0.839\n0.842\n\nto always produce the best models, on a variety of\nmeasures, by using the hybrid weighted random forest\nalgorithm.\nacknowledgements\nthis research is supported in part by nsfc under\ngrant no.61073195, and shenzhen new industry development fund under grant no.cxb201005250021a\nreferences\n[1] breiman, l. (2001) random forests. machine learning,\n45, 5-32.\n[2] ho, t. (1998) random subspace method for constructing decision forests. ieee transactions on pattern\nanalysis and machine intelligence, 20, 832-844.\n[3] quinlan, j. (1993) c4.5: programs for machine\nlearning. morgan kaufmann.\n[4] breiman, l. (1984) classification and regression trees.\nchapman & hall crc.\n[5] breiman, l. (1996) bagging predictors.\nmachine\nlearning, 24, 123-140.\n[6] ho, t. (1995) random decision forests. proceedings\nof the third international conference on document\nanalysis and recognition, pp. 278-282. ieee.\n[7] dietterich, t. (2000) an experimental comparison of\nthree methods for constructing ensembles of decision\ntrees: bagging, boosting, and randomization. machine\nlearning, 40, 139-157.\n\nmac\n0.756\n0.752\n0.619\n0.82\n0.802\n0.798\n0.8\n0.822\n0.812\n0.805\n0.815\nmac\n0.87\n0.762\n0.81\n0.88\n0.87\n0.86\n0.852\n0.88\n0.87\n0.865\n0.855\n\nre1\nmic\n0.826\n0.668\n0.732\n0.832\n0.811\n0.808\n0.815\n0.836\n0.826\n0.818\n0.83\nla2s\nacc\nmic\n0.89\n0.832\n0.841 0.805\n0.896 0.815\n0.89\n0.84\n0.878 0.83\n0.882 0.832\n0.88\n0.83\n0.896 0.848\n0.886 0.835\n0.887 0.835\n0.887 0.833\nacc\n0.829\n0.788\n0.816\n0.841\n0.825\n0.825\n0.838\n0.848\n0.838\n0.835\n0.84\n\ntr41\nmic\n0.915\n0.813\n0.856\n0.926\n0.92\n0.891\n0.903\n0.926\n0.922\n0.91\n0.915\nla1s\nmac\nacc\nmic\n0.807 0.875 0.82\n0.786 0.827 0.798\n0.79\n0.87\n0.802\n0.82\n0.862 0.825\n0.81\n0.855 0.82\n0.81\n0.84\n0.815\n0.803 0.845 0.816\n0.825 0.875 0.836\n0.816 0.866 0.825\n0.812 0.87\n0.825\n0.81\n0.865 0.825\nmac\n0.706\n0.638\n0.58\n0.8\n0.781\n0.783\n0.795\n0.81\n0.795\n0.79\n0.8\n\nacc\n0.95\n0.915\n0.935\n0.953\n0.948\n0.917\n0.926\n0.953\n0.95\n0.935\n0.942\n\nmac\n0.87\n0.765\n0.782\n0.895\n0.89\n0.88\n0.88\n0.895\n0.892\n0.88\n0.88\nmac\n0.803\n0.761\n0.775\n0.805\n0.798\n0.792\n0.795\n0.82\n0.81\n0.81\n0.805\n\n[8] banfield, r., hall, l., bowyer, k., and kegelmeyer, w.\n(2007) a comparison of decision tree ensemble creation\ntechniques. ieee transactions on pattern analysis\nand machine intelligence, 29, 173-180.\n\n[9] robnik-sikonja,\nm. (2004) improving random forests.\nproceedings of the 15th european conference on\nmachine learning, pp. 359-370. springer.\n[10] ho, t. (1998) c4.5 decision forests. proceedings of\nthe fourteenth international conference on pattern\nrecognition, pp. 545-549. ieee.\n[11] dietterrich, t. (1997) machine learning research: four\ncurrent direction. artificial intelligence magzine, 18,\n97-136.\n[12] amaratunga, d., cabrera, j., and lee, y. (2008)\nenriched random forests. bioinformatics, 24, 2010-\n2014.\n[13] ye, y., li, h., deng, x., and huang, j. (2008)\nfeature weighting random forest for detection of hidden\nweb search interfaces. the journal of computational\nlinguistics and chinese language processing, 13, 387-\n404.\n[14] xu, b., huang, j., williams, g., wang, q., and\nye, y. (2012) classifying very high-dimensional data\nwith random forests built from small subspaces.\ninternational journal of data warehousing and\nmining, 8, 45-62.\n[15] xu, b., huang, j., williams, g., li, j., and ye, y.\n(2012) hybrid random forests: advantages of mixed\ntrees in classifying text data. proceedings of the 16th\npacific-asia conference on knowledge discovery and\ndata mining. springer.\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\nhybrid weighted random forests for classifying very high-dimensional data\n[16] biggs, d., de ville, b., and suen, e. (1991) a method\nof choosing multiway partitions for classification and\ndecision trees. journal of applied statistics, 18, 49-62.\n[17] ture, m., kurt, i., turhan kurum, a., and ozdamar,\nk. (2005) comparing classification techniques for\npredicting essential hypertension. expert systems with\napplications, 29, 583-588.\n[18] begum, n., m.a., f., and ren, f. (2009) automatic text summarization using support vector machine.\ninternational journal of innovative computing, information and control, 5, 1987-1996.\n[19] chen, j., huang, h., tian, s., and qu, y. (2009)\nfeature selection for text classification with naive\nbayes. expert systems with applications, 36, 5432-\n5435.\n[20] tan, s. (2005) neighbor-weighted k-nearest neighbor\nfor unbalanced text corpus.\nexpert systems with\napplications, 28, 667-671.\n[21] pearson, k. (1904) on the theory of contingency and\nits relation to association and normal correlation.\ncambridge university press.\n[22] yang, y. and liu, x. (1999) a re-examination of\ntext categorization methods. proceedings of the 22th\ninternational conference on research and development\nin information retrieval, pp. 42-49. acm.\n[23] han, e. and karypis, g. (2000) centroid-based\ndocument classification: analysis and experimental\nresults. proceedings of the 4th european conference on\nprinciples of data mining and knowledge discovery,\npp. 424-431. springer.\n[24] trec.\n(2011)\ntext\nretrieval\nconference,\nhttp: trec.nist.gov.\n[25] lewis,\nd.\n(1999)\nreuters-21578\ntext\ncategorization\ntest\ncollection\ndistribution\n1.0,\nhttp: www.research.att.com lewis.\n[26] han, e., boley, d., gini, m., gross, r., hastings,\nk., karypis, g., kumar, v., mobasher, b., and\nmoore, j. (1998) webace: a web agent for document\ncategorization and exploration. proceedings of the 2nd\ninternational conference on autonomous agents, pp.\n408-415. acm.\n[27] mccallum, a. and nigam, k. (1998) a comparison of\nevent models for naive bayes text classification. aaai98 workshop on learning for text categorization, pp. 41-\n48.\n[28] witten, i., frank, e., and hall, m. (2011) data mining:\npractical machine learning tools and techniques.\nmorgan kaufmann.\n\nthe computer journal, vol. ??,\n\nno. ??,\n\n????\n\n13\n
```

General character processing functions in R can be used to transform our corpus. A common requirement is to map the documents to lower case, using base::tolower(). As above, we need to wrap such functions with a tm::content_transformer():

Your donation will support ongoing development and give you access to the

**PDF version of this book**. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.

Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.