## 21.16 Specific Transformations

We might also have some specific transformations we would like to perform. The examples here may or may not be useful, depending on how we want to analyse the documents. This is really for illustration using the part of the document we are looking at here, rather than suggesting this specific transform adds value.

```
<- content_transformer(function(x, from, to) gsub(from, to, x))
toString <- tm_map(docs, toString, "harbin institute technology", "HIT")
docs <- tm_map(docs, toString, "shenzhen institutes advanced technology", "SIAT")
docs <- tm_map(docs, toString, "chinese academy sciences", "CAS") docs
```

`inspect(docs[16])`

```
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## hwrf12.txt
## hybrid weighted random forests classifying highdimensional data baoxun xu joshua zhexue huang graham williams yunming ye computer science HIT shenzhen graduate school shenzhen china SIAT CAS shenzhen china amusing gmailcom random forests popular classification method based ensemble single type decision trees subspaces data literature many different types decision tree algorithms including c cart chaid type decision tree algorithm may capture different information structure paper proposes hybrid weighted random forest algorithm simultaneously using feature weighting method hybrid forest method classify high dimensional data hybrid weighted random forest algorithm can effectively reduce subspace size improve classification performance without increasing error bound conduct series experiments eight high dimensional datasets compare method traditional random forest methods classification methods results show method consistently outperforms traditional methods keywords random forests hybrid weighted random forest classification decision tree introduction random forests popular classification method builds ensemble single type decision trees different random subspaces data decision trees often either built using c cart one type within single random forest recent years random forests attracted increasing attention due competitive performance compared classification methods especially highdimensional data algorithmic intuitiveness simplicity important capability ensemble using bagging stochastic discrimination several methods proposed grow random forests subspaces data methods popular forest construction procedure proposed breiman first use bagging generate training data subsets building individual trees subspace features randomly selected node grow branches decision tree trees combined ensemble forest ensemble learner performance random forest highly dependent two factors performance tree diversity trees forests breiman formulated overall performance set trees average strength proved generalization error random forest bounded ratio average correlation trees divided square average strength trees high dimensional data text data usually large portion features uninformative classes forest building process informative features large chance missed randomly select small subspace breiman suggested selecting log m features subspace m number independent features data high dimensional data result weak trees created subspaces average strength trees reduced error bound random forest enlarged therefore large proportion weak trees generated random forest forest large likelihood make wrong decision mainly results weak trees classification power address problem aim optimize decision trees random forest two strategies one straightforward strategy enhance classification performance individual trees feature weighting method subspace sampling method feature weights computed respect correlations features class feature regarded probabilities feature selected subspaces method obviously increases classification performance individual computer journal vol baoxun xu joshua zhexue huang graham williams yunming ye trees subspaces will biased contain informative features however chance correlated trees also increased since features large weights likely repeatedly selected second strategy straightforward use several different types decision trees training data subset increase diversity trees select optimal tree individual tree classifier random forest model work presented extends algorithm developed specifically build three different types tree classifiers c cart chaid training data subset evaluate performance three classifiers select best tree way build hybrid random forest may include different types decision trees ensemble added diversity decision trees can effectively improve accuracy tree forest hence classification performance ensemble however use method build best random forest model classifying high dimensional data can sure subspace size best paper propose hybrid weighted random forest algorithm simultaneously using new feature weighting method together hybrid random forest method classify high dimensional data new random forest algorithm calculate feature weights use weighted sampling randomly select features subspaces node building different types trees classifiers c cart chaid training data subset select best tree individual tree final ensemble model experiments performed high dimensional text datasets dimensions ranging compared performance eight random forest methods wellknown classification methods c random forest cart random forest chaid random forest hybrid random forest c weighted random forest cart weighted random forest chaid weighted random forest hybrid weighted random forest support vector machines naive bayes knearest neighbors experimental results show hybrid weighted random forest achieves improved classification performance ten competitive methods remainder paper organized follows section introduce framework building hybrid weighted random forest describe new random forest algorithm section summarizes four measures evaluate random forest models present experimental results high dimensional text datasets section section contains conclusions table contingency table input feature class feature y y y y yj y yq total j q ai ij iq ap p pj pq p total j q general framework building hybrid random forests integrating two methods propose novel hybrid weighted random forest algorithm let y class target feature q distinct class labels yj j q purposes discussion consider single categorical feature dataset d p distinct category values denote distinct values ai p numeric features can discretized p intervals supervised discretization method assume d val objects size subset d satisfying condition ai y yj denoted ij considering combinations categorical values labels y can obtain contingency table y shown table far right column contains marginal totals feature hybrid forests weighted random section first introduce feature weighting method subspace sampling present q ij p j bottom row marginal totals class feature y j p ij j q grand total total number samples bottom right corner q p ij j given training dataset d feature first compute contingency table feature weights computed using two methods discussed following subsection notation feature weighting method subsection give details feature weighting method subspace sampling random forests consider mdimensional feature space present compute computer journal vol hybrid weighted random forests classifying highdimensional data weights w w wm every feature space weights used improved algorithm grow decision tree random forest feature weight computation weight feature represents correlation values feature values class feature y larger weight will indicate class labels objects training dataset correlated values feature indicating informative class objects thus suggested stronger power predicting classes new objects following propose use chisquare statistic compute feature weights method can quantify correspondence two categorical variables given contingency table input feature class feature y dataset d chisquare statistic two features computed corra y q p ij tij tij j ij observed frequency contingency table tij expected frequency computed x j tij larger measure corra y informative feature predicting class y normalized feature weight practice feature weights normalized feature subspace sampling use corra y measure informativeness features consider feature weights however treat weights probabilities features normalize measures ensure sum normalized feature weights equal let corrai y m set m feature measures compute normalized weights corrai y wi n corrai y use square root smooth values measures wi can considered probability feature ai randomly sampled subspace informative feature larger weight higher probability feature selected diversity commonly obtained using bagging random subspace sampling introduce element diversity using different types trees considering analogy forestry different data subsets bagging represent soil structures different decision tree algorithms represent different tree species approach two key aspects one use three types decision tree algorithms generate three different tree classifiers training data subset evaluate accuracy tree measure tree importance paper use outofbag accuracy assess importance tree following breiman use bagging generate series training data subsets build trees tree data subset used grow tree called inofbag iob data remaining data subset called outofbag oob data since oob data used building trees can use data objectively evaluate trees accuracy importance oob accuracy gives unbiased estimate true accuracy model given n instances training dataset d tree classifier hk iobk built kth training data subset iobk define oob accuracy tree hk iobk di d n oobacck framework building hybrid random forest ensemble learner performance random forest highly dependent two factors diversity among trees accuracy tree ihk di yi di oobk n idi oobk indicator function larger oobacck better classification quality tree use outofbag data subset oobi calculate outofbag accuracies three types trees c cart chaid evaluation values e e e respectively fig illustrates procedure building hybrid random forest model firstly series iob oob datasets generated entire training dataset bagging three types tree classifiers c cart chaid built using iob dataset corresponding oob dataset used calculate oob accuracies three tree classifiers finally select tree highest oob accuracy final tree classifier included hybrid random forest building hybrid random forest model way will increase diversity among trees classification performance individual tree classifier also maximized decision tree algorithms core approach diversity decision tree algorithms random forest different decision tree algorithms grow structurally different trees training data selecting good decision tree algorithm grow trees random forest critical computer journal vol baoxun xu joshua zhexue huang graham williams yunming ye difference lies way split node split functions binary branches multibranches work use different decision tree algorithms build hybrid random forest figure hybrid random forests framework performance random forest studies considered different decision tree algorithms affect random forest paper common decision tree algorithms follows classification trees c supervised learning classification algorithm used construct decision trees given set preclassified objects described vector attribute values construct mapping attribute values classes c uses divideandconquer approach grow decision trees beginning entire dataset tree constructed considering predictor variable dividing dataset best predictor chosen node using impurity diversity measure goal produce subsets data homogeneous respect target variable c selects test maximizes information gain ratio igr classification regression tree cart recursive partitioning method can used regression classification main difference c cart test selection evaluation process chisquared automatic interaction detector chaid method based chisquare test association chaid decision tree constructed repeatedly splitting subsets space two nodes determine best split node allowable pair categories predictor variables merged statistically significant difference within pair respect target variable decision tree algorithms can see hybrid weighted random forest algorithm subsection present hybrid weighted random forest algorithm simultaneously using feature weights hybrid method classify high dimensional data benefits algorithm two aspects firstly compared hybrid forest method can use small subspace size create accurate random forest models secondly compared building random forest using feature weighting can use several different types decision trees training data subset increase diversities trees added diversity decision trees can effectively improve classification performance ensemble model detailed steps introduced algorithm input parameters algorithm include training dataset d set features class feature y number trees random forest k size subspaces m output random forest model m lines form loop building k decision trees loop line samples training data d sampling replacement generate inofbag data subset iobi building decision tree line build three types tree classifiers c cart chaid procedure line calls function createt reej build tree classifier line calculates outofbag accuracy tree classifier procedure line selects tree classifier maximum outofbag accuracy k decision tree trees thus generated form hybrid weighted random forest model m generically function createt reej first creates new node tests stopping criteria decide whether return upper node split node choose split node feature weighting method used randomly select m features subspace node splitting features used candidates generate best split partition node subset partition createt reej called create new node current node leaf node created returns parent node recursive process continues full tree generated computer journal vol hybrid weighted random forests classifying highdimensional data algorithm new random forest algorithm input d training dataset features space y class features space y y yq k number trees m size subspaces output random forest m method k draw bootstrap sample inofbag data subset iobi outofbag data subset oobi training dataset d j hij iobi createt reej use outofbag data subset oobi calculate outofbag accuracy oobacci j tree classifier hij iobi equation end select hi iobi highest outofbag accuracy oobacci optimal tree end combine k tree classifiers h iob h iob hk iobk random forest m function createtree create new node n stopping criteria met return n leaf node else j m compute informativeness measure corraj y equation end compute feature weights w w wm equation use feature weighting method randomly select m features use m features candidates generate best split node partitioned call createtree split end return n evaluation measures paper use five measures ie strength correlation error bound c s test accuracy f metric evaluate random forest models strength measures collective performance individual trees random forest correlation measures diversity trees ratio correlation square strength c s indicates generalization error bound random forest model three measures introduced accuracy measures performance random forest model unseen test data f metric commonly used measure classification performance strength correlation measures follow breimans method described calculate strength correlation ratio c s following breimans notation denote strength s correlation let hk iobk kth tree classifier grown kth training data iobk sampled d replacement assume random forest model contains k trees outofbag proportion votes di d class j k ihk di j di iobk qdi j kk iobk k idi number trees random forest trained without di classify di class j divided number training datasets containing di strength s computed qdi yi maxjyi qdi j n n s n number objects d yi indicates true class di correlation computed n qdi yi maxjyi qdi j s n k k k pk pk k pk p n pk ihk di yi di iobk n iobk idi n pk ihk di jdi y di iobk n id iob k jdi y argmaxjyi qd j class obtains maximal number votes among classes true class general error bound measure c s given strength correlation outofbag estimate c s measure can computed important theoretical result breimans method upper bound generalization error random forest ensemble derived p e s s mean value correlations pairs individual classifiers s strength set individual classifiers estimated computer journal vol baoxun xu joshua zhexue huang graham williams yunming ye average accuracy individual classifiers d outofbag evaluation inequality shows generalization error random forest affected strength individual classifiers mutual correlations therefore breiman defined c s ratio measure random forest c s s smaller ratio better performance random forest c s gives guidance reducing generalization error random forests test accuracy test accuracy measures classification performance random forest test data set let dt test data yt class labels given di dt number votes di class j n di j k ihk di j table summary statistic highdimensional datasets name features instances classes minority fbis re re tr wap tr las las emphasizes performance classifier rare categories define follows t pi t pi t pi f pi t pi f ni f category macroaveraged f computed k test accuracy calculated f di yi maxjyi n di j n m acrof q q f n acc n number objects dt yi indicates true class di f metric evaluate performance classification methods dealing unbalanced class distribution use f metric introduced yang liu measure equal harmonic mean recall precision overall f score entire classification problem can computed microaverage macroaverage microaveraged f computed globally classes emphasizes performance classifier common classes define follows q q t pi t pi q q t pi f pi t pi f ni q number classes t pi true positives number objects correctly predicted class f pi false positives number objects predicted belong class microaveraged f computed m icrof macroaveraged f first computed locally class average classes taken larger microf macrof values higher classification performance classifier experiments section present two experiments demonstrate effectiveness new random forest algorithm classifying high dimensional data high dimensional datasets various sizes characteristics used experiments first experiment designed show proposed method can reduce generalization error bound c s improve test accuracy size selected subspace large second experiment used demonstrate classification performance proposed method comparison classification methods ie svm nb knn datasets experiments used eight realworld high dimensional datasets datasets selected due diversities number features number instances number classes dimensionalities vary instances vary minority class rate varies dataset randomly select instances training dataset remaining data test dataset detailed information eight datasets listed table fbis re re tr wap tr las las datasets classical text classification benchmark datasets carefully selected computer journal vol hybrid weighted random forests classifying highdimensional data preprocessed han karypis dataset fbis compiled foreign broadcast information service trec datasets re re selected reuters text categorization test collection distribution datasets tr tr derived trec trec trec dataset wap webace project wap datasets las las selected los angeles times trec classes datasets generated relevance judgment provided collections performance comparisons random forest methods purpose experiment evaluate effect hybrid weighted random forest method h w rf strength correlation c s test accuracy eight high dimensional datasets analyzed results compared seven random forest methods ie c random forest c rf cart random forest cart rf chaid random forest chaid rf hybrid random forest h rf c weighted random forest c w rf cart weighted random forest cart w rf chaid weighted random forest chaid w rf dataset ran random forest algorithm different sizes feature subspaces since number features datasets large started subspace features increased subspace features time given subspace size built trees random forest model order obtain stable result built random forest models subspace size dataset algorithm computed average values four measures strength correlation c s test accuracy final results comparison performance eight random forest algorithms four measures datasets shown figs fig plots strength eight methods different subspace sizes datasets subspace higher strength better result curves can see new algorithm h w rf consistently performs better seven random forest algorithms advantages obvious small subspaces new algorithm quickly achieved higher strength subspace size increases seven random forest algorithms require larger subspaces achieve higher strength results indicate hybrid weighted random forest algorithm enables random forest models achieve higher strength small subspace sizes compared seven random forest algorithms fig plots curves correlations eight random forest methods datasets small subspace sizes h rf c rf cart rf chaid rf produce higher correlations trees datasets correlation decreases subspace size increases random forest models lower correlation trees better final model new random forest algorithm h w rf low correlation level achieved small subspaces datasets also note subspace size increased correlation level increased well understandable subspace size increases informative features likely selected repeatedly subspaces increasing similarity decision trees therefore feature weighting method subspace selection works well small subspaces least point view correlation measure fig shows error bound indicator c s eight methods datasets figures can observe subspace size increases c s consistently reduces behaviour indicates subspace size larger log m benefits eight algorithms however new algorithm h w rf achieved lower level c s subspace size log m seven algorithms fig plots curves showing accuracy eight random forest models test datasets datasets can clearly see new random forest algorithm h w rf outperforms seven random forest algorithms eight data sets can seen new method stable classification performance methods figures observed highest test accuracy often obtained default subspace size log m implies practice large size subspaces necessary grow highquality trees random forests performance comparisons classification methods conducted experimental comparison three widely used text classification methods support vector machines svm naive bayes nb knearest neighbor knn support vector machine used linear kernel regularization parameter often used text categorization naive bayes adopted multivariate bernoulli event model frequently used text classification knearest neighbor knn set number k neighbors experiments used wekas implementation three text classification methods used single subspace size features eight datasets run random forest algorithms h rf c rf cart rf chaid rf used subspace size features first datasets ie fbis re re tr wap computer journal vol baoxun xu joshua zhexue huang graham williams yunming ye fbis re strength strength hwrf cwrf cartwrf hwrf cwrf cartwrf chaidwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number features number features re tr hwrf cwrf cartwrf strength strength hwrf cwrf cartwrf chaidwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number features number features wap tr strength hwrf cwrf cartwrf strength hwrf cwrf cartwrf chaidwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf las las hwrf cwrf cartwrf chaidwrf strength strength number features number features hwrf cwrf cartwrf chaidwrf hrf crf hrf crf cartrf chaidrf cartrf chaidrf number features number features figure strength changes number features subspace high dimensional datasets tr run random forest algorithms used subspace size features last datasets las las run random forest algorithms h w rf c w rf cart w rf chaid w rf used breimans subspace size log m run random forest algorithms number features provided consistent result shown fig order obtain stable results built random forest models random forest algorithm dataset present average computer journal vol hybrid weighted random forests classifying highdimensional data fbis re correlation correlation hwrf cwrf cartwrf chaidwrf hwrf cwrf cartwrf chaidwrf hrf crf hrf crf cartrf chaidrf cartrf chaidrf number features number features re tr hwrf cwrf cartwrf chaidwrf correlation correlation hwrf cwrf cartwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number features number features wap tr correlation hwrf cwrf cartwrf chaidwrf correlation hwrf cwrf cartwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number features number features las las hwrf cwrf cartwrf chaidwrf correlation correlation hwrf cwrf cartwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number features number features figure correlation changes number features subspace high dimensional datasets results noting range values less hybrid trees always accurate comparison results classification performance eleven methods shown table performance estimated using test accuracy acc micro f mic macro f mac boldface denotes best results eleven classification methods improvement often quite small always improvement demonstrated observe proposed method h w rf computer journal vol baoxun xu joshua zhexue huang graham williams yunming ye fbis re log m cwrf hwrf c s c s cartwrf hwrf cwrf cartwrf chaidwrf chaidwrf hrf hrf log m crf crf cartrf cartrf chaidrf chaidrf number features log m c s hwrf tr c s number features re cwrf hwrf cwrf cartwrf cartwrf chaidwrf chaidwrf hrf hrf crf log m crf cartrf cartrf chaidrf chaidrf number features number features wap tr log m log m cwrf c s c s hwrf hwrf cwrf cartwrf cartwrf chaidwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number features number features las las cwrf cartwrf hrf crf cartwrf chaidrf hwrf cwrf cartwrf chaidwrf log m c s c s hwrf chaidwrf log m hrf crf cartrf chaidrf number features number features figure c s changes number features subspace high dimensional datasets outperformed classification methods datasets conclusions paper presented hybrid weighted random forest algorithm simultaneously using feature weighting method hybrid forest method classify computer journal vol hybrid weighted random forests classifying highdimensional data fbis re hwrf cwrf cartwrf accuracy accuracy chaidwrf hrf hwrf cwrf cartwrf chaidwrf hrf crf crf log m cartrf cartrf log m chaidrf chaidrf number features number features re tr log m hwrf cwrf cartwrf accuracy accuracy chaidwrf hwrf cwrf cartwrf chaidwrf hrf hrf log m crf crf cartrf cartrf chaidrf chaidrf number features number features wap tr hwrf cwrf cartwrf accuracy accuracy hwrf cwrf cartwrf chaidwrf chaidwrf hrf log m hrf crf cartrf crf log m cartrf chaidrf chaidrf number features number features las las accuracy hwrf cwrf cartwrf chaidwrf accuracy hwrf cwrf cartwrf chaidwrf hrf hrf log m crf crf log m cartrf cartrf chaidrf chaidrf number features number features figure test accuracy changes number features subspace high dimensional datasets high dimensional data algorithm retains small subspace size breimans formula log m determining subspace size create accurate random forest models also effectively reduces upper bound generalization error improves classification performance results experiments various high dimensional datasets random forest generated new method superior classification methods can use default log m subspace size generally guarantee computer journal vol baoxun xu joshua zhexue huang graham williams yunming ye table comparison results datasets dataset fbis measures acc mic svm knn nb h rf c rf cart rf chaid rf h w rf c w rf cart w rf chaid w rf dataset wap measures acc mic svm knn nb h rf c rf cart rf chaid rf h w rf c w rf cart w rf chaid w rf best accuracy micro f macro f results eleven methods re mic tr mac acc mic mac acc always produce best models variety measures using hybrid weighted random forest algorithm acknowledgements research supported part nsfc grant shenzhen new industry development fund grant nocxba references breiman l random forests machine learning ho t random subspace method constructing decision forests ieee transactions pattern analysis machine intelligence quinlan j c programs machine learning morgan kaufmann breiman l classification regression trees chapman hall crc breiman l bagging predictors machine learning ho t random decision forests proceedings third international conference document analysis recognition pp ieee dietterich t experimental comparison three methods constructing ensembles decision trees bagging boosting randomization machine learning mac mac re mic las acc mic acc tr mic las mac acc mic mac acc mac mac banfield r hall l bowyer k kegelmeyer w comparison decision tree ensemble creation techniques ieee transactions pattern analysis machine intelligence robniksikonja m improving random forests proceedings th european conference machine learning pp springer ho t c decision forests proceedings fourteenth international conference pattern recognition pp ieee dietterrich t machine learning research four current direction artificial intelligence magzine amaratunga d cabrera j lee y enriched random forests bioinformatics ye y li h deng x huang j feature weighting random forest detection hidden web search interfaces journal computational linguistics chinese language processing xu b huang j williams g wang q ye y classifying highdimensional data random forests built small subspaces international journal data warehousing mining xu b huang j williams g li j ye y hybrid random forests advantages mixed trees classifying text data proceedings th pacificasia conference knowledge discovery data mining springer computer journal vol hybrid weighted random forests classifying highdimensional data biggs d de ville b suen e method choosing multiway partitions classification decision trees journal applied statistics ture m kurt turhan kurum ozdamar k comparing classification techniques predicting essential hypertension expert systems applications begum n ma f ren f automatic text summarization using support vector machine international journal innovative computing information control chen j huang h tian s qu y feature selection text classification naive bayes expert systems applications tan s neighborweighted knearest neighbor unbalanced text corpus expert systems applications pearson k theory contingency relation association normal correlation cambridge university press yang y liu x reexamination text categorization methods proceedings th international conference research development information retrieval pp acm han e karypis g centroidbased document classification analysis experimental results proceedings th european conference principles data mining knowledge discovery pp springer trec text retrieval conference http trecnistgov lewis d reuters text categorization test collection distribution http wwwresearchattcom lewis han e boley d gini m gross r hastings k karypis g kumar v mobasher b moore j webace web agent document categorization exploration proceedings nd international conference autonomous agents pp acm mccallum nigam k comparison event models naive bayes text classification aaai workshop learning text categorization pp witten frank e hall m data mining practical machine learning tools techniques morgan kaufmann computer journal vol
```

Your donation will support ongoing development and give you access to the

**PDF version of this book**. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.

Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.