20.13 Remove English Stop Words

docs <- tm_map(docs, removeWords, stopwords("english"))
inspect(docs[16])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             hwrf12.txt 
## hybrid weighted random forests \nclassifying  highdimensional data\nbaoxun xu  joshua zhexue huang  graham williams \nyunming ye\n\n\ndepartment  computer science harbin institute  technology shenzhen graduate\nschool shenzhen  china\n\nshenzhen institutes  advanced technology chinese academy  sciences shenzhen\n china\nemail amusing gmailcom\nrandom forests   popular classification method based   ensemble  \nsingle type  decision trees  subspaces  data   literature \n many different types  decision tree algorithms including c cart \nchaid  type  decision tree algorithm may capture different information\n structure  paper proposes  hybrid weighted random forest algorithm\nsimultaneously using  feature weighting method   hybrid forest method \nclassify  high dimensional data  hybrid weighted random forest algorithm\ncan effectively reduce subspace size  improve classification performance\nwithout increasing  error bound  conduct  series  experiments  eight\nhigh dimensional datasets  compare  method  traditional random forest\nmethods   classification methods  results show   method\nconsistently outperforms  traditional methods\nkeywords random forests hybrid weighted random forest classification decision tree\n\n\n\nintroduction\n\nrandom forests     popular classification\nmethod  builds  ensemble   single type\n decision trees  different random subspaces \ndata  decision trees  often either built using\nc   cart    one type within\n single random forest  recent years random\nforests  attracted increasing attention due \n  competitive performance compared  \nclassification methods especially  highdimensional\ndata  algorithmic intuitiveness  simplicity \n   important capability  ensemble using\nbagging   stochastic discrimination \nseveral methods   proposed  grow random\nforests  subspaces  data        \n methods   popular forest construction\nprocedure  proposed  breiman   first use\nbagging  generate training data subsets  building\nindividual trees\n subspace  features  \nrandomly selected   node  grow branches \n decision tree  trees   combined  \nensemble   forest   ensemble learner \nperformance   random forest  highly dependent\n two factors  performance   tree  \ndiversity   trees   forests  breiman\nformulated  overall performance   set  trees \n average strength  proved   generalization\n\nerror   random forest  bounded   ratio  \naverage correlation  trees divided   square\n  average strength   trees\n  high dimensional data   text data\n  usually  large portion  features  \nuninformative   classes   forest building\nprocess informative features    large\nchance   missed   randomly select  small\nsubspace breiman suggested selecting log m   \nfeatures   subspace  m   number \nindependent features   data  high dimensional\ndata    result weak trees  created  \nsubspaces  average strength   trees  reduced\n  error bound   random forest  enlarged\ntherefore   large proportion   weak\ntrees  generated   random forest  forest  \nlarge likelihood  make  wrong decision  mainly\nresults   weak trees classification power\n address  problem  aim  optimize decision\ntrees   random forest  two strategies one\nstraightforward strategy   enhance  classification\nperformance  individual trees   feature weighting\nmethod  subspace sampling     \nmethod feature weights  computed  respect\n  correlations  features   class feature\n regarded   probabilities   feature \n selected  subspaces  method obviously\nincreases  classification performance  individual\n\n computer journal vol \n\n \n\n\n\n\n\nbaoxun xu joshua zhexue huang graham williams yunming ye\n\ntrees   subspaces will  biased  contain\n informative features however  chance  \ncorrelated trees  also increased since  features \nlarge weights  likely   repeatedly selected\n second strategy   straightforward use\nseveral different types  decision trees   training\ndata subset  increase  diversity   trees\n  select  optimal tree   individual\ntree classifier   random forest model  work\npresented  extends  algorithm developed  \nspecifically  build three different types  tree\nclassifiers c cart  chaid    \ntraining data subset   evaluate  performance\n  three classifiers  select  best tree \n way  build  hybrid random forest  may\ninclude different types  decision trees   ensemble\n added diversity   decision trees can effectively\nimprove  accuracy   tree   forest \nhence  classification performance   ensemble\nhowever   use  method  build  best\nrandom forest model  classifying high dimensional\ndata  can   sure   subspace size  best\n  paper  propose  hybrid weighted random\nforest algorithm  simultaneously using  new feature\nweighting method together   hybrid random\nforest method  classify high dimensional data \n new random forest algorithm  calculate feature\nweights  use weighted sampling  randomly select\nfeatures  subspaces   node  building different\ntypes  trees classifiers c cart  chaid \n training data subset  select  best tree \n individual tree   final ensemble model\nexperiments  performed   high dimensional\ntext datasets  dimensions ranging   \n  compared  performance  eight random\nforest methods  wellknown classification methods\nc random forest cart random forest chaid\nrandom forest hybrid random forest c weighted\nrandom forest cart weighted random forest chaid\nweighted random forest hybrid weighted random\nforest support vector machines  naive bayes \n knearest neighbors \n experimental\nresults show   hybrid weighted random forest\nachieves improved classification performance  \nten competitive methods\n remainder   paper  organized  follows\n section   introduce  framework  building\n hybrid weighted random forest  describe  new\nrandom forest algorithm section  summarizes four\nmeasures  evaluate random forest models  present\nexperimental results   high dimensional text datasets\n section  section  contains  conclusions\n\ntable  contingency table  input feature   class\nfeature y\ny  y   \ny  yj   \ny  yq total\n  \n\n\nj\n\nq\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  ai\n\n\nij\n\niq\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  ap\np\n\npj\n\npq\np\ntotal\n\n\nj\n\nq\n\n\ngeneral framework  building hybrid random forests\n integrating  two methods  propose  novel\nhybrid weighted random forest algorithm\n\n\nlet y   class  target feature  q distinct\nclass labels yj  j       q   purposes \n discussion  consider  single categorical feature\n  dataset d  p distinct category values \ndenote  distinct values  ai         p\nnumeric features can  discretized  p intervals \n supervised discretization method\nassume d  val objects  size   subset \nd satisfying  condition    ai  y  yj \ndenoted  ij  considering  combinations  \ncategorical values     labels  y   can\nobtain  contingency table     y  shown\n table   far right column contains  marginal\ntotals  feature \n\nhybrid\nforests\n\nweighted\n\nrandom\n\n  section  first introduce  feature weighting\nmethod  subspace sampling   present \n\nq\n\n\n \n\nij\n\n        p\n\n\n\nj\n\n  bottom row   marginal totals  class\nfeature y \nj \n\np\n\n\nij\n\n j       q\n\n\n\n\n\n grand total  total number  samples  \n bottom right corner\n\n\nq \np\n\n\nij\n\n\n\nj \n\ngiven  training dataset d  feature   first\ncompute  contingency table  feature weights \n computed using  two methods   discussed\n  following subsection\n\n\n\n\nnotation\n\nfeature weighting method\n\n  subsection  give  details   feature\nweighting method  subspace sampling  random\nforests consider  mdimensional feature space\n           present   compute \n\n computer journal vol \n\n \n\n\n\nhybrid weighted random forests  classifying  highdimensional data\nweights w  w      wm   every feature   space\n weights   used   improved algorithm\n grow  decision tree   random forest\n feature weight computation\n weight  feature  represents  correlation\n  values  feature    values  \nclass feature y   larger weight will indicate  \nclass labels  objects   training dataset  \ncorrelated   values  feature  indicating \n   informative   class  objects thus \n suggested     stronger power  predicting\n classes  new objects\n  following  propose  use  chisquare\nstatistic  compute feature weights  \nmethod can quantify  correspondence  two\ncategorical variables\ngiven  contingency table   input feature  \n class feature y  dataset d  chisquare statistic\n  two features  computed \ncorra y  \n\nq\np \n\nij  tij \ntij\n j\n\n\n\n ij   observed frequency  \ncontingency table  tij   expected frequency\ncomputed \n x j\ntij \n\n\n\n\n larger  measure corra y   \ninformative  feature    predicting class y \n normalized feature weight\n practice feature weights  normalized  feature\nsubspace sampling  use corra y   measure \ninformativeness   features  consider \n feature weights however  treat  weights \nprobabilities  features  normalize  measures \nensure  sum   normalized feature weights \nequal   let corrai  y      m    set\n m feature measures  compute  normalized\nweights \n\ncorrai  y \nwi  n \n\n corrai  y \n  use  square root  smooth  values \n measures wi can  considered   probability\n feature ai  randomly sampled   subspace \n informative  feature   larger  weight \n higher  probability   feature  selected\n\ndiversity  commonly obtained  using bagging \nrandom subspace sampling  introduce  \nelement  diversity  using different types  trees\nconsidering  analogy  forestry  different data subsets  bagging represent  soil structures different decision tree algorithms represent different tree species  approach  two key aspects\none   use three types  decision tree algorithms \ngenerate three different tree classifiers   training data subset     evaluate  accuracy\n  tree   measure  tree importance  \npaper  use  outofbag accuracy  assess  importance   tree\nfollowing breiman   use bagging  generate\n series  training data subsets    build\ntrees   tree  data subset used  grow\n tree  called  inofbag iob data  \nremaining data subset  called  outofbag oob\ndata since oob data   used  building trees\n can use  data  objectively evaluate  trees\naccuracy  importance  oob accuracy gives \nunbiased estimate   true accuracy   model\ngiven n instances   training dataset d   tree\nclassifier hk iobk  built   kth training data\nsubset iobk   define  oob accuracy   tree\nhk iobk   di  d \nn\noobacck \n\nframework  building  hybrid random\nforest\n\n  ensemble learner  performance   random\nforest  highly dependent  two factors  diversity\namong  trees   accuracy   tree \n\n\n\nihk di   yi  di  oobk \nn\n idi  oobk \n\n\n\n    indicator function  larger \noobacck   better  classification quality   tree\n use  outofbag data subset oobi  calculate\n outofbag accuracies   three types  trees\nc cart  chaid  evaluation values e \ne  e respectively\nfig  illustrates  procedure  building  hybrid\nrandom forest model firstly  series  iob oob\ndatasets  generated   entire training dataset\n bagging  three types  tree classifiers c\ncart  chaid  built using  iob dataset\n corresponding oob dataset  used  calculate \noob accuracies   three tree classifiers finally\n select  tree   highest oob accuracy \n final tree classifier   included   hybrid\nrandom forest\nbuilding  hybrid random forest model  \nway will increase  diversity among  trees\n classification performance   individual tree\nclassifier  also maximized\n\n\n\n\n\n\ndecision tree algorithms\n\n core   approach   diversity  decision\ntree algorithms   random forest different decision\ntree algorithms grow structurally different trees \n  training data selecting  good decision tree\nalgorithm  grow trees   random forest  critical\n\n computer journal vol \n\n \n\n\n\n\n\nbaoxun xu joshua zhexue huang graham williams yunming ye\n difference lies   way  split  node \n  split functions  binary branches  multibranches   work  use  different decision\ntree algorithms  build  hybrid random forest\n\n\n\nfigure   hybrid random forests framework\n\n  performance   random forest  studies\n considered  different decision tree algorithms\naffect  random forest      paper\n common decision tree algorithms   follows\nclassification trees  c   supervised\nlearning classification algorithm used  construct\ndecision trees given  set  preclassified objects \ndescribed   vector  attribute values  construct\n mapping  attribute values  classes c uses\n divideandconquer approach  grow decision trees\nbeginning   entire dataset  tree  constructed\n considering  predictor variable  dividing \ndataset  best predictor  chosen   node\nusing  impurity  diversity measure  goal \n produce subsets   data   homogeneous\n respect   target variable c selects  test\n maximizes  information gain ratio igr \nclassification  regression tree cart \n recursive partitioning method  can  used \n regression  classification  main difference\n c  cart   test selection \nevaluation process\nchisquared automatic interaction detector\nchaid method  based   chisquare test \nassociation  chaid decision tree  constructed\n repeatedly splitting subsets   space  two\n  nodes  determine  best split  \nnode  allowable pair  categories   predictor\nvariables  merged     statistically\nsignificant difference within  pair  respect  \ntarget variable  \n  decision tree algorithms  can see \n\nhybrid weighted random forest algorithm\n\n  subsection  present  hybrid weighted\nrandom forest algorithm  simultaneously using \nfeature weights   hybrid method  classify high\ndimensional data  benefits   algorithm \ntwo aspects firstly compared  hybrid forest\nmethod   can use  small subspace size \ncreate accurate random forest models\nsecondly\ncompared  building  random forest using feature\nweighting   can use several different types \ndecision trees   training data subset  increase\n diversities  trees  added diversity  \ndecision trees can effectively improve  classification\nperformance   ensemble model  detailed steps\n introduced  algorithm \ninput parameters  algorithm  include  training\ndataset d  set  features   class feature y \n number  trees   random forest k  \nsize  subspaces m  output   random forest\nmodel m  lines  form  loop  building k\ndecision trees   loop line  samples  training\ndata d  sampling  replacement  generate \ninofbag data subset iobi  building  decision tree\nline  build three types  tree classifiers c\ncart  chaid   procedure line  calls\n function createt reej   build  tree classifier\nline  calculates  outofbag accuracy   tree\nclassifier   procedure line  selects  tree\nclassifier   maximum outofbag accuracy k\ndecision tree trees  thus generated  form  hybrid\nweighted random forest model m \ngenerically function createt reej  first creates \nnew node   tests  stopping criteria  decide\nwhether  return   upper node   split \nnode   choose  split  node   feature\nweighting method  used  randomly select m features\n  subspace  node splitting  features\n used  candidates  generate  best split \npartition  node   subset   partition\ncreatet reej   called   create  new node \n current node   leaf node  created  returns \n parent node  recursive process continues \n full tree  generated\n\n computer journal vol \n\n \n\n\n\nhybrid weighted random forests  classifying  highdimensional data\nalgorithm  new random forest algorithm\n input\n  d   training dataset\n     features space       \n  y   class features space y  y   yq \n  k   number  trees\n  m   size  subspaces\n output  random forest m \n method\n      k \n\ndraw  bootstrap sample inofbag data subset\niobi  outofbag data subset oobi \ntraining dataset d\n\n j     \n\nhij iobi   createt reej \nuse outofbag data subset oobi  calculate\n\n outofbag accuracy oobacci j   tree\nclassifier hij iobi   equation\n\nend \n\nselect hi iobi    highest outofbag\naccuracy oobacci  optimal tree \n end \n combine\n\nk\ntree\nclassifiers\nh iob  h iob   hk iobk    random\nforest m \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfunction createtree\ncreate  new node n \n stopping criteria  met \nreturn n   leaf node\nelse\n j    m \ncompute\n\ninformativeness\nmeasure\ncorraj  y   equation \nend \ncompute feature weights w  w   wm  \nequation \nuse  feature weighting method  randomly\nselect m features\nuse  m features  candidates  generate\n best split   node   partitioned\ncall createtree   split\nend \nreturn n \nevaluation measures\n\n  paper  use five measures ie strength\ncorrelation error bound c s  test accuracy  f\nmetric  evaluate  random forest models strength\nmeasures  collective performance  individual trees\n  random forest   correlation measures \ndiversity   trees  ratio   correlation\n  square   strength c s indicates \ngeneralization error bound   random forest model\n three measures  introduced   \naccuracy measures  performance   random forest\nmodel  unseen test data  f metric  \n\n\n\ncommonly used measure  classification performance\n\n\nstrength  correlation measures\n\n follow breimans method described   \ncalculate  strength correlation   ratio c s \nfollowing breimans notation  denote strength \ns  correlation   let hk iobk    kth\ntree classifier grown   kth training data iobk\nsampled  d  replacement\nassume \nrandom forest model contains k trees  outofbag\nproportion  votes  di  d  class j \nk\nihk di   j di \n  iobk \nqdi  j  kk\n\n  iobk \nk idi \n   number  trees   random forest\n  trained without di  classify di  class\nj divided   number  training datasets \ncontaining di \n strength s  computed \n\nqdi  yi   maxjyi qdi  j\nn \nn\n\ns\n\n\n\n n   number  objects  d  yi indicates\n true class  di \n correlation   computed \nn\n\n\n\n qdi  yi   maxjyi qdi  j  s\nn\n\n \n\n\nk\n\nk\nk  pk  pk  \nk pk  p\n\n\nn\npk \n\n\n\nihk di   yi  di \n  iobk \nn\n  iobk \n idi \n\n\n\n\nn\npk \n\n\n\nihk di   jdi  y  di \n  iobk \nn\nid\n\n \niob\n\n\nk\n\n\n\n\n\njdi  y   argmaxjyi qd j\n\n\n\n  class  obtains  maximal number  votes\namong  classes   true class\n\n\ngeneral error bound measure c s\n\ngiven  strength  correlation  outofbag\nestimate   c s measure can  computed\n important theoretical result  breimans method\n  upper bound   generalization error  \nrandom forest ensemble   derived \np e    s  s\n\n\n\n    mean value  correlations  \npairs  individual classifiers  s   strength \n set  individual classifiers   estimated  \n\n computer journal vol \n\n \n\n\n\n\n\nbaoxun xu joshua zhexue huang graham williams yunming ye\n\naverage accuracy  individual classifiers  d \noutofbag evaluation  inequality shows  \ngeneralization error   random forest  affected \n strength  individual classifiers   mutual\ncorrelations therefore breiman defined  c s ratio\n measure  random forest \nc s   s\n\n\n\n smaller  ratio  better  performance \n random forest   c s gives guidance \nreducing  generalization error  random forests\n\n\ntest accuracy\n\n test accuracy measures  classification performance   random forest   test data set let\ndt   test data  yt   class labels given\ndi  dt   number  votes  di  class j \nn di  j \n\nk\n\n\nihk di   j\n\n\n\ntable \nsummary statistic   highdimensional\ndatasets\nname\nfeatures\ninstances\nclasses  minority\nfbis\n\n\n\n\nre\n\n\n\n\nre\n\n\n\n\ntr\n\n\n\n\nwap\n\n\n\n\ntr\n\n\n\n\nlas\n\n\n\n\nlas\n\n\n\n\n\n emphasizes  performance   classifier  rare\ncategories define     follows\n\n \n\nt pi\nt pi\n  \nt pi  f pi \nt pi  f ni \n\n\n\nf    category    macroaveraged f\n computed \n\nk\n\n test accuracy  calculated \nf  \n\n di  yi   maxjyi n di  j   \nn \n\n \n m acrof  \n  \n\nq\n\n\nq\n\nf \n\n\n\nn\n\nacc \n\n n   number  objects  dt  yi indicates\n true class  di \n\n\nf metric\n\n evaluate  performance  classification methods\n dealing   unbalanced class distribution  use\n f metric introduced  yang  liu  \nmeasure  equal   harmonic mean  recall \n precision   overall f score   entire\nclassification problem can  computed   microaverage   macroaverage\nmicroaveraged f  computed globally  \nclasses  emphasizes  performance   classifier\n common classes define     follows\nq\n\nq\nt pi\n t pi\n  q \n   q\n\n t pi  f pi \n t pi  f ni \n q   number  classes t pi true positives\n  number  objects correctly predicted  class \nf pi false positives   number  objects  \npredicted  belong  class      microaveraged f  computed \nm icrof  \n\n\n\n\n\n\nmacroaveraged f  first computed locally \n class    average   classes  taken\n\n larger  microf  macrof values  \nhigher  classification performance   classifier\n\n\nexperiments\n\n  section  present two experiments \ndemonstrate  effectiveness   new random\nforest algorithm  classifying high dimensional data\nhigh dimensional datasets  various sizes \ncharacteristics  used   experiments \nfirst experiment  designed  show   proposed\nmethod can reduce  generalization error bound\nc s   improve test accuracy   size \n selected subspace    large  second\nexperiment  used  demonstrate  classification\nperformance   proposed method  comparison \n classification methods ie svm nb  knn\n\n\ndatasets\n\n  experiments  used eight realworld high\ndimensional datasets  datasets  selected\ndue   diversities   number  features \nnumber  instances   number  classes \ndimensionalities vary     instances\nvary       minority class rate varies\n      dataset  randomly\nselect   instances   training dataset \n remaining data   test dataset detailed\ninformation   eight datasets  listed  table \n fbis re re tr wap tr las\n las datasets  classical text classification\nbenchmark datasets   carefully selected \n\n computer journal vol \n\n \n\n\n\nhybrid weighted random forests  classifying  highdimensional data\npreprocessed  han  karypis  dataset fbis\n compiled   foreign broadcast information\nservice trec   datasets re  re \nselected   reuters text categorization test\ncollection distribution    datasets tr \ntr  derived  trec  trec \n trec  dataset wap    webace\nproject wap   datasets las  las \nselected   los angeles times  trec \n classes   datasets  generated  \nrelevance judgment provided   collections\n\n\nperformance comparisons  random forest methods\n\n purpose   experiment   evaluate\n effect   hybrid weighted random forest\nmethod h w rf  strength correlation c s \n test accuracy\n eight high dimensional\ndatasets  analyzed  results  compared\n seven  random forest methods ie c\nrandom forest c rf cart random forest\ncart rf chaid random forest chaid rf\nhybrid random forest h rf c weighted random\nforest c w rf cart weighted random forest\ncart w rf chaid weighted random forest\nchaid w rf   dataset  ran \nrandom forest algorithm  different sizes  \nfeature subspaces since  number  features  \ndatasets   large  started   subspace\n  features  increased  subspace   \nfeatures  time   given subspace size  built\n trees   random forest model  order \nobtain  stable result  built  random forest models\n  subspace size  dataset   algorithm\n computed  average values   four measures\n strength correlation c s   test accuracy  \nfinal results  comparison  performance  \neight random forest algorithms   four measures\n     datasets  shown  figs    \n\nfig  plots  strength   eight methods \ndifferent subspace sizes      datasets\n   subspace  higher  strength \nbetter  result   curves  can see \n new algorithm h w rf consistently performs\nbetter   seven  random forest algorithms\n advantages   obvious  small subspaces\n new algorithm quickly achieved higher strength\n  subspace size increases\n seven \nrandom forest algorithms require larger subspaces \nachieve  higher strength  results indicate \n hybrid weighted random forest algorithm enables\nrandom forest models  achieve  higher strength\n small subspace sizes compared   seven \nrandom forest algorithms\nfig  plots  curves   correlations  \neight random forest methods    datasets \n\n\n\nsmall subspace sizes h rf c rf cart rf\n chaid rf produce higher correlations \n trees   datasets  correlation decreases\n  subspace size increases   random forest\nmodels  lower  correlation   trees\n  better  final model\n  new\nrandom forest algorithm h w rf  low correlation\nlevel  achieved   small subspaces  \n datasets  also note    subspace size\nincreased  correlation level increased  well  \nunderstandable    subspace size increases\n  informative features   likely  \nselected repeatedly   subspaces increasing \nsimilarity   decision trees therefore  feature\nweighting method  subspace selection works well \nsmall subspaces  least   point  view  \ncorrelation measure\nfig  shows  error bound indicator c s  \neight methods    datasets   figures\n can observe    subspace size increases c s\nconsistently reduces  behaviour indicates  \nsubspace size larger  log m  benefits  eight\nalgorithms however  new algorithm h w rf\nachieved  lower level  c s  subspace size \nlog m      seven  algorithms\nfig  plots  curves showing  accuracy  \neight random forest models   test datasets \n  datasets  can clearly see   new random\nforest algorithm h w rf outperforms  seven\n random forest algorithms   eight data sets\n can  seen   new method   stable\n classification performance   methods \n   figures   observed   highest test\naccuracy  often obtained   default subspace size\n log m     implies   practice large\nsize subspaces   necessary  grow highquality\ntrees  random forests\n\n\nperformance comparisons\nclassification methods\n\n\n\n\n\n conducted   experimental comparison\n three  widely used text classification\nmethods support vector machines svm naive\nbayes nb  knearest neighbor knn \nsupport vector machine used  linear kernel  \nregularization parameter     often\nused  text categorization  naive bayes \nadopted  multivariate bernoulli event model \n frequently used  text classification   knearest neighbor knn  set  number k \nneighbors     experiments  used wekas\nimplementation   three text classification\nmethods   used  single subspace size \nfeatures   eight datasets  run  random forest\nalgorithms  h rf c rf cart rf \nchaid rf  used  subspace size   features \n first  datasets ie fbis re re tr wap \n\n computer journal vol \n\n \n\n\n\n\n\nbaoxun xu joshua zhexue huang graham williams yunming ye\nfbis\n\nre\n\n\n\n\n\n\n\n\n\n\n\nstrength\n\nstrength\n\n\n\n\n\n\n\nhwrf\n\n\n\ncwrf\ncartwrf\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\n\nchaidwrf\n\nchaidwrf\n\n\n\nhrf\n\n\n\nhrf\n\ncrf\n\ncrf\n\n\n\ncartrf\n\n\n\ncartrf\n\nchaidrf\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nnumber  features\n\nre\n\n\n\n\n\ntr\n\n\n\n\n\n\n\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\n\nstrength\n\nstrength\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\n\nchaidwrf\n\n\n\nchaidwrf\n\nhrf\n\nhrf\n\n\n\ncrf\n\n\n\ncrf\n\ncartrf\n\ncartrf\n\n\n\nchaidrf\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nnumber  features\n\nwap\n\ntr\n\n\n\n\n\n\n\n\n\n\nstrength\n\nhwrf\n\n\n\ncwrf\ncartwrf\n\n\n\nstrength\n\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\nchaidwrf\n\nchaidwrf\nhrf\n\n\n\nhrf\n\n\n\ncrf\n\ncrf\n\ncartrf\n\ncartrf\n\n\n\n\n\nchaidrf\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nlas\n\n\n\n\n\n\n\n\n\n\n\nlas\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nhwrf\n\n\n\ncwrf\ncartwrf\nchaidwrf\n\n\n\nstrength\n\nstrength\n\n\n\nnumber  features\n\nnumber  features\n\n\n\n\n\nhwrf\n\n\n\ncwrf\ncartwrf\n\n\n\nchaidwrf\n\nhrf\ncrf\n\n\n\nhrf\n\n\n\ncrf\n\ncartrf\nchaidrf\n\ncartrf\n\n\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nfigure  strength changes   number  features   subspace    high dimensional datasets\n\ntr  run  random forest algorithms  used\n subspace size   features   last  datasets\nlas  las  run  random forest algorithms\n h w rf c w rf cart w rf \nchaid w rf  used breimans subspace size \n\nlog m     run  random forest algorithms\n number  features provided  consistent result \nshown  fig   order  obtain stable results \nbuilt  random forest models   random forest\nalgorithm   dataset  present  average\n\n computer journal vol \n\n \n\n\n\nhybrid weighted random forests  classifying  highdimensional data\nfbis\n\n\n\nre\n\n\n\n\n\n\n\n\n\ncorrelation\n\ncorrelation\n\n\n\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\nchaidwrf\n\n\n\nhwrf\n\n\n\ncwrf\ncartwrf\n\n\n\nchaidwrf\n\nhrf\ncrf\n\n\n\nhrf\n\n\n\ncrf\n\ncartrf\nchaidrf\n\ncartrf\n\n\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nre\n\n\n\n\n\ntr\n\n\n\n\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\nchaidwrf\n\n\n\ncorrelation\n\ncorrelation\n\n\n\n\n\nhwrf\n\n\n\ncwrf\ncartwrf\n\n\n\nchaidwrf\nhrf\n\nhrf\ncrf\n\n\n\ncrf\n\n\n\ncartrf\n\ncartrf\n\n\n\nchaidrf\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nnumber  features\n\nwap\n\ntr\n\n\n\n\n\n\n\n\ncorrelation\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\nchaidwrf\n\n\n\ncorrelation\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\nchaidwrf\n\n\n\nhrf\n\nhrf\n\ncrf\n\n\n\ncrf\n\ncartrf\n\ncartrf\n\n\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nnumber  features\n\nlas\n\nlas\n\n\n\n\n\n\n\n\n\n\n\n\n\nhwrf\n\n\n\ncwrf\ncartwrf\n\n\n\nchaidwrf\n\n\n\ncorrelation\n\ncorrelation\n\n\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\nchaidwrf\n\n\n\nhrf\n\nhrf\ncrf\n\n\n\ncrf\n\n\n\ncartrf\n\ncartrf\nchaidrf\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nfigure  correlation changes   number  features   subspace    high dimensional datasets\n\nresults noting   range  values  less \n   hybrid trees  always  accurate\n comparison results  classification performance\n eleven methods  shown  table \n\nperformance  estimated using test accuracy acc\n\nmicro f mic  macro f mac boldface\ndenotes best results  eleven classification\nmethods\n  improvement  often quite\nsmall   always  improvement demonstrated\n observe   proposed method h w rf\n\n computer journal vol \n\n \n\n\n\n\n\nbaoxun xu joshua zhexue huang graham williams yunming ye\nfbis\n\n\n\nre\n\n\n\n\nlog m\n\n\n\n\n\n\n\n\n\n\n\ncwrf\n\n\n\n\n\n\n\nhwrf\n\nc s\n\nc s\n\n\n\n\n\ncartwrf\n\nhwrf\ncwrf\n\n\n\ncartwrf\n\nchaidwrf\n\n\n\nchaidwrf\n\n\n\nhrf\n\nhrf\n\nlog m\n\n\ncrf\n\n\n\ncrf\n\n\n\ncartrf\n\ncartrf\n\nchaidrf\n\n\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\n\n\n\n\n\n\n\n\n\n\n\n\nlog m\n\n\n\n\n\n\n\nc s\n\nhwrf\n\n\n\n\n\n\n\n\n\n\n\ntr\n\n\n\n\n\nc s\n\n\n\nnumber  features\n\nre\n\n\n\n\n\ncwrf\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\n\ncartwrf\n\nchaidwrf\n\n\n\nchaidwrf\n\n\n\nhrf\n\nhrf\n\ncrf\n\nlog m\n\n\n\n\n\ncrf\n\n\n\ncartrf\n\ncartrf\n\nchaidrf\n\n\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nnumber  features\n\nwap\n\ntr\n\n\n\n\n\n\n\nlog m\n\nlog m\n\n\n\n\n\n\n\n\ncwrf\n\nc s\n\n\n\nc s\n\nhwrf\n\n\n\n\n\n\n\n\n\nhwrf\n\n\n\ncwrf\ncartwrf\n\ncartwrf\n\n\nchaidwrf\n\n\n\nchaidwrf\nhrf\n\nhrf\ncrf\n\n\n\ncrf\n\n\n\ncartrf\n\ncartrf\nchaidrf\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nnumber  features\n\nlas\n\nlas\n\n\n\n\n\n\n\n\n\n\n\ncwrf\ncartwrf\n\n\n\n\n\nhrf\n\n\n\ncrf\ncartwrf\n\n\n\nchaidrf\n\n\n\n\n\n\n\n\n\nhwrf\n\n\n\ncwrf\ncartwrf\n\n\n\nchaidwrf\n\nlog m\n\n\n\nc s\n\nc s\n\n\n\n\nhwrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nchaidwrf\n\nlog m\n\n\n\n\nhrf\ncrf\n\n\n\ncartrf\nchaidrf\n\n\n\n\n\n\n\n\n\nnumber  features\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nfigure  c s changes   number  features   subspace    high dimensional datasets\n\noutperformed   classification methods  \ndatasets\n\n\n\nconclusions\n\n  paper  presented  hybrid weighted random\nforest algorithm  simultaneously using  feature\nweighting method   hybrid forest method  classify\n computer journal vol \n\n \n\n\n\nhybrid weighted random forests  classifying  highdimensional data\nfbis\n\n\n\n\n\nre\n\n\n\n\n\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\n\naccuracy\n\naccuracy\n\n\n\nchaidwrf\nhrf\n\n\n\n\nhwrf\n\n\n\ncwrf\ncartwrf\n\n\n\nchaidwrf\nhrf\n\n\n\ncrf\n\ncrf\n\nlog m\n\ncartrf\n\n\n\n\n\ncartrf\n\nlog m\n\n\n\n\n\nchaidrf\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nre\n\ntr\n\n\n\n\n\n\n\n\nlog m\n\n\n\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\n\n\n\naccuracy\n\naccuracy\n\n\n\nchaidwrf\n\n\n\nhwrf\n\n\n\ncwrf\ncartwrf\n\n\n\nchaidwrf\nhrf\n\nhrf\n\n\n\nlog m\n\n\n\ncrf\n\ncrf\n\n\n\ncartrf\n\ncartrf\n\n\n\n\n\nchaidrf\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nnumber  features\n\nwap\n\n\n\n\n\ntr\n\n\n\n\n\n\n\n\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\n\naccuracy\n\naccuracy\n\n\n\n\n\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\nchaidwrf\n\nchaidwrf\n\n\n\nhrf\n\nlog m\n\n\n\n\nhrf\n\ncrf\n\n\n\ncartrf\n\n\n\ncrf\n\nlog m\n\n\n\n\ncartrf\nchaidrf\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nnumber  features\n\nlas\n\n\n\n\n\nlas\n\n\n\n\n\n\n\naccuracy\n\n\nhwrf\ncwrf\n\n\n\ncartwrf\nchaidwrf\n\n\n\naccuracy\n\n\n\n\n\n\n\nhwrf\ncwrf\ncartwrf\n\n\n\nchaidwrf\nhrf\n\nhrf\n\nlog m\n\n\n\ncrf\n\n\n\ncrf\n\nlog m\n\n\n\n\n\ncartrf\n\ncartrf\nchaidrf\n\nchaidrf\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nnumber  features\n\nfigure  test accuracy changes   number  features   subspace    high dimensional datasets\n\nhigh dimensional data  algorithm   retains\n small subspace size breimans formula log m   \n determining  subspace size  create accurate\nrandom forest models  also effectively reduces\n upper bound   generalization error \n\nimproves classification performance   results \nexperiments  various high dimensional datasets \nrandom forest generated   new method  superior\n  classification methods  can use  default\nlog m    subspace size  generally guarantee\n\n computer journal vol \n\n \n\n\n\n\n\nbaoxun xu joshua zhexue huang graham williams yunming ye\n\ntable   comparison  results\ndatasets\ndataset\nfbis\nmeasures\nacc\nmic\nsvm\n \nknn\n\n\nnb\n \nh rf\n \nc rf\n \ncart rf\n \nchaid rf\n \nh w rf\n \nc w rf\n \ncart w rf\n \nchaid w rf\n \ndataset\nwap\nmeasures\nacc\nmic\nsvm\n\n\nknn\n \nnb\n \nh rf\n \nc rf\n \ncart rf\n \nchaid rf\n \nh w rf\n \nc w rf\n \ncart w rf\n\n\nchaid w rf\n \n\nbest accuracy micro f  macro f results   eleven methods   \nre\nmic\n\n\n\n\n\n\n\n\n\n\n\ntr\nmac\nacc\nmic\n  \n  \n  \n  \n  \n\n \n \n\n  \n  \n\n\n\n\n\n\n\nmac\n\n\n\n\n\n\n\n\n\n\n\n\nacc\n\n\n\n\n\n\n\n\n\n\n\n\n always produce  best models   variety \nmeasures  using  hybrid weighted random forest\nalgorithm\nacknowledgements\n research  supported  part  nsfc \ngrant   shenzhen new industry development fund  grant nocxba\nreferences\n breiman l  random forests machine learning\n \n ho t  random subspace method  constructing decision forests ieee transactions  pattern\nanalysis  machine intelligence  \n quinlan j  c programs  machine\nlearning morgan kaufmann\n breiman l  classification  regression trees\nchapman  hall crc\n breiman l  bagging predictors\nmachine\nlearning  \n ho t  random decision forests proceedings\n  third international conference  document\nanalysis  recognition pp  ieee\n dietterich t   experimental comparison \nthree methods  constructing ensembles  decision\ntrees bagging boosting  randomization machine\nlearning  \n\nmac\n\n\n\n\n\n\n\n\n\n\n\nmac\n\n\n\n\n\n\n\n\n\n\n\n\nre\nmic\n\n\n\n\n\n\n\n\n\n\n\nlas\nacc\nmic\n\n\n \n \n\n\n \n \n\n\n \n \n \n \nacc\n\n\n\n\n\n\n\n\n\n\n\n\ntr\nmic\n\n\n\n\n\n\n\n\n\n\n\nlas\nmac\nacc\nmic\n  \n  \n\n\n\n\n \n\n \n\n\n\n  \n  \n  \n \n\n\n \nmac\n\n\n\n\n\n\n\n\n\n\n\n\nacc\n\n\n\n\n\n\n\n\n\n\n\n\nmac\n\n\n\n\n\n\n\n\n\n\n\nmac\n\n\n\n\n\n\n\n\n\n\n\n\n banfield r hall l bowyer k  kegelmeyer w\n  comparison  decision tree ensemble creation\ntechniques ieee transactions  pattern analysis\n machine intelligence  \n\n robniksikonja\nm  improving random forests\nproceedings   th european conference \nmachine learning pp  springer\n ho t  c decision forests proceedings \n fourteenth international conference  pattern\nrecognition pp  ieee\n dietterrich t  machine learning research four\ncurrent direction artificial intelligence magzine \n\n amaratunga d cabrera j  lee y \nenriched random forests bioinformatics  \n\n ye y li h deng x  huang j \nfeature weighting random forest  detection  hidden\nweb search interfaces  journal  computational\nlinguistics  chinese language processing  \n\n xu b huang j williams g wang q \nye y  classifying  highdimensional data\n random forests built  small subspaces\ninternational journal  data warehousing \nmining  \n xu b huang j williams g li j  ye y\n hybrid random forests advantages  mixed\ntrees  classifying text data proceedings   th\npacificasia conference  knowledge discovery \ndata mining springer\n\n computer journal vol \n\n \n\n\n\nhybrid weighted random forests  classifying  highdimensional data\n biggs d de ville b  suen e   method\n choosing multiway partitions  classification \ndecision trees journal  applied statistics  \n ture m kurt  turhan kurum   ozdamar\nk  comparing classification techniques \npredicting essential hypertension expert systems \napplications  \n begum n ma f  ren f  automatic text summarization using support vector machine\ninternational journal  innovative computing information  control  \n chen j huang h tian s  qu y \nfeature selection  text classification  naive\nbayes expert systems  applications  \n\n tan s  neighborweighted knearest neighbor\n unbalanced text corpus\nexpert systems \napplications  \n pearson k    theory  contingency \n relation  association  normal correlation\ncambridge university press\n yang y  liu x   reexamination \ntext categorization methods proceedings   th\ninternational conference  research  development\n information retrieval pp  acm\n han e  karypis g  centroidbased\ndocument classification analysis  experimental\nresults proceedings   th european conference \nprinciples  data mining  knowledge discovery\npp  springer\n trec\n\ntext\nretrieval\nconference\nhttp  trecnistgov\n lewis\nd\n\nreuters\ntext\ncategorization\ntest\ncollection\ndistribution\n\nhttp  wwwresearchattcom  lewis\n han e boley d gini m gross r hastings\nk karypis g kumar v mobasher b \nmoore j  webace  web agent  document\ncategorization  exploration proceedings   nd\ninternational conference  autonomous agents pp\n acm\n mccallum   nigam k   comparison \nevent models  naive bayes text classification aaai workshop  learning  text categorization pp \n\n witten  frank e  hall m  data mining\npractical machine learning tools  techniques\nmorgan kaufmann\n\n computer journal vol \n\n \n\n\n\n\n

Stop words are common words found in a language. Words like for, very, and, of, are, etc, are common stop words. Notice they have been removed from the above text.

We can list the stop words:

length(stopwords("english"))
## [1] 174
stopwords("english")
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"


Your donation will support ongoing development and give you access to the PDF version of the book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.