21.17 Stemming

docs <- tm_map(docs, stemDocument)
viewDocs(docs, 16)
## hybrid weight random forest classifi highdimension data baoxun xu joshua zhexu huang graham william yunm ye comput scienc HIT shenzhen graduat school shenzhen china SIAT CAS shenzhen china amus gmailcom random forest popular classif method base ensembl singl type decis tree subspac data literatur mani differ type decis tree algorithm includ c cart chaid type decis tree algorithm may captur differ inform structur paper propos hybrid weight random forest algorithm simultan use featur weight method hybrid forest method classifi high dimension data hybrid weight random forest algorithm can effect reduc subspac size improv classif perform without increas error bound conduct seri experi eight high dimension dataset compar method tradit random forest method classif method result show method consist outperform tradit method keyword random forest hybrid weight random forest classif decis tree introduct random forest popular classif method build ensembl singl type decis tree differ random subspac data decis tree often either built use c cart one type within singl random forest recent year random forest attract increas attent due competit perform compar classif method especi highdimension data algorithm intuit simplic import capabl ensembl use bag stochast discrimin sever method propos grow random forest subspac data method popular forest construct procedur propos breiman first use bag generat train data subset build individu tree subspac featur random select node grow branch decis tree tree combin ensembl forest ensembl learner perform random forest high depend two factor perform tree divers tree forest breiman formul overal perform set tree averag strength prove general error random forest bound ratio averag correl tree divid squar averag strength tree high dimension data text data usual larg portion featur uninform class forest build process inform featur larg chanc miss random select small subspac breiman suggest select log m featur subspac m number independ featur data high dimension data result weak tree creat subspac averag strength tree reduc error bound random forest enlarg therefor larg proport weak tree generat random forest forest larg likelihood make wrong decis main result weak tree classif power address problem aim optim decis tree random forest two strategi one straightforward strategi enhanc classif perform individu tree featur weight method subspac sampl method featur weight comput respect correl featur class featur regard probabl featur select subspac method obvious increas classif perform individu comput journal vol baoxun xu joshua zhexu huang graham william yunm ye tree subspac will bias contain inform featur howev chanc correl tree also increas sinc featur larg weight like repeat select second strategi straightforward use sever differ type decis tree train data subset increas divers tree select optim tree individu tree classifi random forest model work present extend algorithm develop specif build three differ type tree classifi c cart chaid train data subset evalu perform three classifi select best tree way build hybrid random forest may includ differ type decis tree ensembl ad divers decis tree can effect improv accuraci tree forest henc classif perform ensembl howev use method build best random forest model classifi high dimension data can sure subspac size best paper propos hybrid weight random forest algorithm simultan use new featur weight method togeth hybrid random forest method classifi high dimension data new random forest algorithm calcul featur weight use weight sampl random select featur subspac node build differ type tree classifi c cart chaid train data subset select best tree individu tree final ensembl model experi perform high dimension text dataset dimens rang compar perform eight random forest method wellknown classif method c random forest cart random forest chaid random forest hybrid random forest c weight random forest cart weight random forest chaid weight random forest hybrid weight random forest support vector machin naiv bay knearest neighbor experiment result show hybrid weight random forest achiev improv classif perform ten competit method remaind paper organ follow section introduc framework build hybrid weight random forest describ new random forest algorithm section summar four measur evalu random forest model present experiment result high dimension text dataset section section contain conclus tabl conting tabl input featur class featur y y y y yj y yq total j q ai ij iq ap p pj pq p total j q general framework build hybrid random forest integr two method propos novel hybrid weight random forest algorithm let y class target featur q distinct class label yj j q purpos discuss consid singl categor featur dataset d p distinct categori valu denot distinct valu ai p numer featur can discret p interv supervis discret method assum d val object size subset d satisfi condit ai y yj denot ij consid combin categor valu label y can obtain conting tabl y shown tabl far right column contain margin total featur hybrid forest weight random section first introduc featur weight method subspac sampl present q ij p j bottom row margin total class featur y j p ij j q grand total total number sampl bottom right corner q p ij j given train dataset d featur first comput conting tabl featur weight comput use two method discuss follow subsect notat featur weight method subsect give detail featur weight method subspac sampl random forest consid mdimension featur space present comput comput journal vol hybrid weight random forest classifi highdimension data weight w w wm everi featur space weight use improv algorithm grow decis tree random forest featur weight comput weight featur repres correl valu featur valu class featur y larger weight will indic class label object train dataset correl valu featur indic inform class object thus suggest stronger power predict class new object follow propos use chisquar statist comput featur weight method can quantifi correspond two categor variabl given conting tabl input featur class featur y dataset d chisquar statist two featur comput corra y q p ij tij tij j ij observ frequenc conting tabl tij expect frequenc comput x j tij larger measur corra y inform featur predict class y normal featur weight practic featur weight normal featur subspac sampl use corra y measur inform featur consid featur weight howev treat weight probabl featur normal measur ensur sum normal featur weight equal let corrai y m set m featur measur comput normal weight corrai y wi n corrai y use squar root smooth valu measur wi can consid probabl featur ai random sampl subspac inform featur larger weight higher probabl featur select divers common obtain use bag random subspac sampl introduc element divers use differ type tree consid analog forestri differ data subset bag repres soil structur differ decis tree algorithm repres differ tree speci approach two key aspect one use three type decis tree algorithm generat three differ tree classifi train data subset evalu accuraci tree measur tree import paper use outofbag accuraci assess import tree follow breiman use bag generat seri train data subset build tree tree data subset use grow tree call inofbag iob data remain data subset call outofbag oob data sinc oob data use build tree can use data object evalu tree accuraci import oob accuraci give unbias estim true accuraci model given n instanc train dataset d tree classifi hk iobk built kth train data subset iobk defin oob accuraci tree hk iobk di d n oobacck framework build hybrid random forest ensembl learner perform random forest high depend two factor divers among tree accuraci tree ihk di yi di oobk n idi oobk indic function larger oobacck better classif qualiti tree use outofbag data subset oobi calcul outofbag accuraci three type tree c cart chaid evalu valu e e e respect fig illustr procedur build hybrid random forest model first seri iob oob dataset generat entir train dataset bag three type tree classifi c cart chaid built use iob dataset correspond oob dataset use calcul oob accuraci three tree classifi final select tree highest oob accuraci final tree classifi includ hybrid random forest build hybrid random forest model way will increas divers among tree classif perform individu tree classifi also maxim decis tree algorithm core approach divers decis tree algorithm random forest differ decis tree algorithm grow structur differ tree train data select good decis tree algorithm grow tree random forest critic comput journal vol baoxun xu joshua zhexu huang graham william yunm ye differ lie way split node split function binari branch multibranch work use differ decis tree algorithm build hybrid random forest figur hybrid random forest framework perform random forest studi consid differ decis tree algorithm affect random forest paper common decis tree algorithm follow classif tree c supervis learn classif algorithm use construct decis tree given set preclassifi object describ vector attribut valu construct map attribut valu class c use divideandconqu approach grow decis tree begin entir dataset tree construct consid predictor variabl divid dataset best predictor chosen node use impur divers measur goal produc subset data homogen respect target variabl c select test maxim inform gain ratio igr classif regress tree cart recurs partit method can use regress classif main differ c cart test select evalu process chisquar automat interact detector chaid method base chisquar test associ chaid decis tree construct repeat split subset space two node determin best split node allow pair categori predictor variabl merg statist signific differ within pair respect target variabl decis tree algorithm can see hybrid weight random forest algorithm subsect present hybrid weight random forest algorithm simultan use featur weight hybrid method classifi high dimension data benefit algorithm two aspect first compar hybrid forest method can use small subspac size creat accur random forest model second compar build random forest use featur weight can use sever differ type decis tree train data subset increas divers tree ad divers decis tree can effect improv classif perform ensembl model detail step introduc algorithm input paramet algorithm includ train dataset d set featur class featur y number tree random forest k size subspac m output random forest model m line form loop build k decis tree loop line sampl train data d sampl replac generat inofbag data subset iobi build decis tree line build three type tree classifi c cart chaid procedur line call function createt reej build tree classifi line calcul outofbag accuraci tree classifi procedur line select tree classifi maximum outofbag accuraci k decis tree tree thus generat form hybrid weight random forest model m generic function createt reej first creat new node test stop criteria decid whether return upper node split node choos split node featur weight method use random select m featur subspac node split featur use candid generat best split partit node subset partit createt reej call creat new node current node leaf node creat return parent node recurs process continu full tree generat comput journal vol hybrid weight random forest classifi highdimension data algorithm new random forest algorithm input d train dataset featur space y class featur space y y yq k number tree m size subspac output random forest m method k draw bootstrap sampl inofbag data subset iobi outofbag data subset oobi train dataset d j hij iobi createt reej use outofbag data subset oobi calcul outofbag accuraci oobacci j tree classifi hij iobi equat end select hi iobi highest outofbag accuraci oobacci optim tree end combin k tree classifi h iob h iob hk iobk random forest m function createtre creat new node n stop criteria met return n leaf node els j m comput inform measur corraj y equat end comput featur weight w w wm equat use featur weight method random select m featur use m featur candid generat best split node partit call createtre split end return n evalu measur paper use five measur ie strength correl error bound c s test accuraci f metric evalu random forest model strength measur collect perform individu tree random forest correl measur divers tree ratio correl squar strength c s indic general error bound random forest model three measur introduc accuraci measur perform random forest model unseen test data f metric common use measur classif perform strength correl measur follow breiman method describ calcul strength correl ratio c s follow breiman notat denot strength s correl let hk iobk kth tree classifi grown kth train data iobk sampl d replac assum random forest model contain k tree outofbag proport vote di d class j k ihk di j di iobk qdi j kk iobk k idi number tree random forest train without di classifi di class j divid number train dataset contain di strength s comput qdi yi maxjyi qdi j n n s n number object d yi indic true class di correl comput n qdi yi maxjyi qdi j s n k k k pk pk k pk p n pk ihk di yi di iobk n iobk idi n pk ihk di jdi y di iobk n id iob k jdi y argmaxjyi qd j class obtain maxim number vote among class true class general error bound measur c s given strength correl outofbag estim c s measur can comput import theoret result breiman method upper bound general error random forest ensembl deriv p e s s mean valu correl pair individu classifi s strength set individu classifi estim comput journal vol baoxun xu joshua zhexu huang graham william yunm ye averag accuraci individu classifi d outofbag evalu inequ show general error random forest affect strength individu classifi mutual correl therefor breiman defin c s ratio measur random forest c s s smaller ratio better perform random forest c s give guidanc reduc general error random forest test accuraci test accuraci measur classif perform random forest test data set let dt test data yt class label given di dt number vote di class j n di j k ihk di j tabl summari statist highdimension dataset name featur instanc class minor fbis re re tr wap tr las las emphas perform classifi rare categori defin follow t pi t pi t pi f pi t pi f ni f categori macroaverag f comput k test accuraci calcul f di yi maxjyi n di j n m acrof q q f n acc n number object dt yi indic true class di f metric evalu perform classif method deal unbalanc class distribut use f metric introduc yang liu measur equal harmon mean recal precis overal f score entir classif problem can comput microaverag macroaverag microaverag f comput global class emphas perform classifi common class defin follow q q t pi t pi q q t pi f pi t pi f ni q number class t pi true posit number object correct predict class f pi fals posit number object predict belong class microaverag f comput m icrof macroaverag f first comput local class averag class taken larger microf macrof valu higher classif perform classifi experi section present two experi demonstr effect new random forest algorithm classifi high dimension data high dimension dataset various size characterist use experi first experi design show propos method can reduc general error bound c s improv test accuraci size select subspac larg second experi use demonstr classif perform propos method comparison classif method ie svm nb knn dataset experi use eight realworld high dimension dataset dataset select due divers number featur number instanc number class dimension vari instanc vari minor class rate vari dataset random select instanc train dataset remain data test dataset detail inform eight dataset list tabl fbis re re tr wap tr las las dataset classic text classif benchmark dataset care select comput journal vol hybrid weight random forest classifi highdimension data preprocess han karypi dataset fbis compil foreign broadcast inform servic trec dataset re re select reuter text categor test collect distribut dataset tr tr deriv trec trec trec dataset wap webac project wap dataset las las select los angel time trec class dataset generat relev judgment provid collect perform comparison random forest method purpos experi evalu effect hybrid weight random forest method h w rf strength correl c s test accuraci eight high dimension dataset analyz result compar seven random forest method ie c random forest c rf cart random forest cart rf chaid random forest chaid rf hybrid random forest h rf c weight random forest c w rf cart weight random forest cart w rf chaid weight random forest chaid w rf dataset ran random forest algorithm differ size featur subspac sinc number featur dataset larg start subspac featur increas subspac featur time given subspac size built tree random forest model order obtain stabl result built random forest model subspac size dataset algorithm comput averag valu four measur strength correl c s test accuraci final result comparison perform eight random forest algorithm four measur dataset shown fig fig plot strength eight method differ subspac size dataset subspac higher strength better result curv can see new algorithm h w rf consist perform better seven random forest algorithm advantag obvious small subspac new algorithm quick achiev higher strength subspac size increas seven random forest algorithm requir larger subspac achiev higher strength result indic hybrid weight random forest algorithm enabl random forest model achiev higher strength small subspac size compar seven random forest algorithm fig plot curv correl eight random forest method dataset small subspac size h rf c rf cart rf chaid rf produc higher correl tree dataset correl decreas subspac size increas random forest model lower correl tree better final model new random forest algorithm h w rf low correl level achiev small subspac dataset also note subspac size increas correl level increas well understand subspac size increas inform featur like select repeat subspac increas similar decis tree therefor featur weight method subspac select work well small subspac least point view correl measur fig show error bound indic c s eight method dataset figur can observ subspac size increas c s consist reduc behaviour indic subspac size larger log m benefit eight algorithm howev new algorithm h w rf achiev lower level c s subspac size log m seven algorithm fig plot curv show accuraci eight random forest model test dataset dataset can clear see new random forest algorithm h w rf outperform seven random forest algorithm eight data set can seen new method stabl classif perform method figur observ highest test accuraci often obtain default subspac size log m impli practic larg size subspac necessari grow highqual tree random forest perform comparison classif method conduct experiment comparison three wide use text classif method support vector machin svm naiv bay nb knearest neighbor knn support vector machin use linear kernel regular paramet often use text categor naiv bay adopt multivari bernoulli event model frequent use text classif knearest neighbor knn set number k neighbor experi use weka implement three text classif method use singl subspac size featur eight dataset run random forest algorithm h rf c rf cart rf chaid rf use subspac size featur first dataset ie fbis re re tr wap comput journal vol baoxun xu joshua zhexu huang graham william yunm ye fbis re strength strength hwrf cwrf cartwrf hwrf cwrf cartwrf chaidwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number featur number featur re tr hwrf cwrf cartwrf strength strength hwrf cwrf cartwrf chaidwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number featur number featur wap tr strength hwrf cwrf cartwrf strength hwrf cwrf cartwrf chaidwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf las las hwrf cwrf cartwrf chaidwrf strength strength number featur number featur hwrf cwrf cartwrf chaidwrf hrf crf hrf crf cartrf chaidrf cartrf chaidrf number featur number featur figur strength chang number featur subspac high dimension dataset tr run random forest algorithm use subspac size featur last dataset las las run random forest algorithm h w rf c w rf cart w rf chaid w rf use breiman subspac size log m run random forest algorithm number featur provid consist result shown fig order obtain stabl result built random forest model random forest algorithm dataset present averag comput journal vol hybrid weight random forest classifi highdimension data fbis re correl correl hwrf cwrf cartwrf chaidwrf hwrf cwrf cartwrf chaidwrf hrf crf hrf crf cartrf chaidrf cartrf chaidrf number featur number featur re tr hwrf cwrf cartwrf chaidwrf correl correl hwrf cwrf cartwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number featur number featur wap tr correl hwrf cwrf cartwrf chaidwrf correl hwrf cwrf cartwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number featur number featur las las hwrf cwrf cartwrf chaidwrf correl correl hwrf cwrf cartwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number featur number featur figur correl chang number featur subspac high dimension dataset result note rang valu less hybrid tree alway accur comparison result classif perform eleven method shown tabl perform estim use test accuraci acc micro f mic macro f mac boldfac denot best result eleven classif method improv often quit small alway improv demonstr observ propos method h w rf comput journal vol baoxun xu joshua zhexu huang graham william yunm ye fbis re log m cwrf hwrf c s c s cartwrf hwrf cwrf cartwrf chaidwrf chaidwrf hrf hrf log m crf crf cartrf cartrf chaidrf chaidrf number featur log m c s hwrf tr c s number featur re cwrf hwrf cwrf cartwrf cartwrf chaidwrf chaidwrf hrf hrf crf log m crf cartrf cartrf chaidrf chaidrf number featur number featur wap tr log m log m cwrf c s c s hwrf hwrf cwrf cartwrf cartwrf chaidwrf chaidwrf hrf hrf crf crf cartrf cartrf chaidrf chaidrf number featur number featur las las cwrf cartwrf hrf crf cartwrf chaidrf hwrf cwrf cartwrf chaidwrf log m c s c s hwrf chaidwrf log m hrf crf cartrf chaidrf number featur number featur figur c s chang number featur subspac high dimension dataset outperform classif method dataset conclus paper present hybrid weight random forest algorithm simultan use featur weight method hybrid forest method classifi comput journal vol hybrid weight random forest classifi highdimension data fbis re hwrf cwrf cartwrf accuraci accuraci chaidwrf hrf hwrf cwrf cartwrf chaidwrf hrf crf crf log m cartrf cartrf log m chaidrf chaidrf number featur number featur re tr log m hwrf cwrf cartwrf accuraci accuraci chaidwrf hwrf cwrf cartwrf chaidwrf hrf hrf log m crf crf cartrf cartrf chaidrf chaidrf number featur number featur wap tr hwrf cwrf cartwrf accuraci accuraci hwrf cwrf cartwrf chaidwrf chaidwrf hrf log m hrf crf cartrf crf log m cartrf chaidrf chaidrf number featur number featur las las accuraci hwrf cwrf cartwrf chaidwrf accuraci hwrf cwrf cartwrf chaidwrf hrf hrf log m crf crf log m cartrf cartrf chaidrf chaidrf number featur number featur figur test accuraci chang number featur subspac high dimension dataset high dimension data algorithm retain small subspac size breiman formula log m determin subspac size creat accur random forest model also effect reduc upper bound general error improv classif perform result experi various high dimension dataset random forest generat new method superior classif method can use default log m subspac size general guarante comput journal vol baoxun xu joshua zhexu huang graham william yunm ye tabl comparison result dataset dataset fbis measur acc mic svm knn nb h rf c rf cart rf chaid rf h w rf c w rf cart w rf chaid w rf dataset wap measur acc mic svm knn nb h rf c rf cart rf chaid rf h w rf c w rf cart w rf chaid w rf best accuraci micro f macro f result eleven method re mic tr mac acc mic mac acc alway produc best model varieti measur use hybrid weight random forest algorithm acknowledg research support part nsfc grant shenzhen new industri develop fund grant nocxba refer breiman l random forest machin learn ho t random subspac method construct decis forest ieee transact pattern analysi machin intellig quinlan j c program machin learn morgan kaufmann breiman l classif regress tree chapman hall crc breiman l bag predictor machin learn ho t random decis forest proceed third intern confer document analysi recognit pp ieee dietterich t experiment comparison three method construct ensembl decis tree bag boost random machin learn mac mac re mic las acc mic acc tr mic las mac acc mic mac acc mac mac banfield r hall l bowyer k kegelmey w comparison decis tree ensembl creation techniqu ieee transact pattern analysi machin intellig robniksikonja m improv random forest proceed th european confer machin learn pp springer ho t c decis forest proceed fourteenth intern confer pattern recognit pp ieee dietterrich t machin learn research four current direct artifici intellig magzin amaratunga d cabrera j lee y enrich random forest bioinformat ye y li h deng x huang j featur weight random forest detect hidden web search interfac journal comput linguist chines languag process xu b huang j william g wang q ye y classifi highdimension data random forest built small subspac intern journal data wareh mine xu b huang j william g li j ye y hybrid random forest advantag mix tree classifi text data proceed th pacificasia confer knowledg discoveri data mine springer comput journal vol hybrid weight random forest classifi highdimension data bigg d de vill b suen e method choos multiway partit classif decis tree journal appli statist ture m kurt turhan kurum ozdamar k compar classif techniqu predict essenti hypertens expert system applic begum n ma f ren f automat text summar use support vector machin intern journal innov comput inform control chen j huang h tian s qu y featur select text classif naiv bay expert system applic tan s neighborweight knearest neighbor unbalanc text corpus expert system applic pearson k theori conting relat associ normal correl cambridg univers press yang y liu x reexamin text categor method proceed th intern confer research develop inform retriev pp acm han e karypi g centroidbas document classif analysi experiment result proceed th european confer principl data mine knowledg discoveri pp springer trec text retriev confer http trecnistgov lewi d reuter text categor test collect distribut http wwwresearchattcom lewi han e boley d gini m gross r hast k karypi g kumar v mobash b moor j webac web agent document categor explor proceed nd intern confer autonom agent pp acm mccallum nigam k comparison event model naiv bay text classif aaai workshop learn text categor pp witten frank e hall m data mine practic machin learn tool techniqu morgan kaufmann comput journal vol

Stemming uses an algorithm that removes common word endings for English words, such as es'',ed’’ and ``‘s’’.



Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.