Hands-On Data Science with R

Dr Graham Williams, PhD (ANU, Machine Learning Ensembles), BSc (Maths, Hons)
Data Scientist, Togaware and Australian Taxation Office
Adjunct Professor, Australian National University and University of Canberra
International Visiting Professor, Chinese Academy of Sciences

Preface

Welcome to our Data Science resources from Togaware. On this page we find drafts of chapters for an upcoming book on R Programming and Data Science. The material here is in various stages of completeness and I do request that if you make use of the material then please provide feedback. Our aim is to freely provide extensive material to support the Data Scientist.

Togaware also provides a unique offering of on-site hands-on training using R for Data Science. We offer traditional out-of-office training courses, but we find more effective learning can occur hands-on on-site. We offer one of the world's leading Data Scientists to work alongside and mentor your staff over one or two weeks. We work confidentially on actual projects with training "on-the-job" provided by a professional with 30 years experience in the industry and author of the best selling book on Data Mining with Rattle and R. Contact Togaware Training through training@togaware.com for details.

Our material here begins with an overview of how an organization should go about setting up their Analytics capability and then introduce the Data Scientist to the most fully featured yet cost effective toolkit available: R.

A word of warning though - be aware that IT departments will often mistakenly see Analytics like they see standard off-the-shelf software. Some are getting it but many still have a journey to travel. Often the IT department will want to survey the available commercial products (and today there's hundreds of products available), seek advise and understanding from the vendors who all have their own vested interests to take care of (rather than seeking the advice of the practitioners), decide on one provider (the one true solution), purchase that product and required infrastructure, and believe that they have delivered an Analytics capability to support the organization for the next 10 years. This might work well for traditional and mature products like accounting software and data warehouses and transaction processing, but I've seen millions wasted on software by ICT departments simply not getting it.

Analytics is about the skill of a Data Scientist using a variety of tools and platforms that are changing quickly. Today's multi-million investment will quickly go out the window - so please don't do it - save the money for your organisation by investing in the skills of your people rather than expensive closed-source software when even more powerful alternative open source software is available.

Instead of expensive deployments that will live stably for the next 10 years, modernise your culture to be prepared for agile and inexpensive deployments that will grow and change rapidly. Some of the investment we might quickly move on from and we need to be prepared to do so - it's not easy to move on from very expensive software packages.

See Analyst First for further views along these lines.

Our on-line resources, including Hands-On Data Science, weave together a collection of freely available and open source tools for the Data Scientist. The tools are all part of the R Statistical Software Suite. Each chapter provides a great place to start the journey as a data scientist and from a training point of view (and part of Togaware's business) provides a chance to decide whether to engage our hands-on training experts.

The material here aims to be a hands-on guide and then used as a reference guide. Each section aims to be a bite sized chunk for hands-on learning, building on what has gone before. Many chapters also have a lecture pack and a laboratory session where a number of tasks can be completed. The R code sitting behind each chapter is also provided and can be easily run standalone to replicate the material presented in the chapter.

The material is always under development! Chapters will change (and hopefully improve) regularly. Links preceded with a * are more well developed. All of the material here is provided under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License allowing access to everyone for any purpose (except commercial) and is provided at no cost. You are welcome but certainly not required to assist in helping cover the costs of providing this material through a $40 contribution using PayPal. Your support encourages further development of this resource as does feedback, suggestions, and ideas, which are always welcome.

Refer to the Data Mining Survival Guide or my book on Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery (Use R) for related material.

Many of the initial chapters were developed and tested whilst visiting the Shenzhen Institutes of Advanced Technology as an International Visiting Professor of the Chinese Academy of Sciences.

The data used across the chapters is available for download as data.zip.

Enjoy!

Part 1: Data Science
  1. Data Science and Analytics *Lecture - *Chapter - *R - *Data Mining Lecture
  2. Rattle to R: *Chapter - *R
  3. Literate Data Science with KnitR: *Lecture - *Chapter - *R
  4. A Template for Preparing Data: *Chapter - *R
  5. A Template for Building Models: *Chapter - *R
  6. Case Studies: *Chapter - *R
Part 2: R Programming
  1. Doing R with Style: *Chapter - *R
  2. The Basics of R Chapter - R
Part 3: Dealing With Data
  1. Reading Data into R: *Chapter - *R
  2. Exploring and Summarising Data: *Chapter - *R
  3. Visualising Data with GGPlot2: *Chapter - *R
  4. Transforming Data: *Chapter - *R
Part 4: Descriptive Analytics
  1. Cluster Analysis: *Lecture - Chapter - R
  2. Association Analysis: *Lecture - Chapter - R
Part 5: Predictive Analytics
  1. Decision Trees: *Lecture - *Chapter - *R - *Rattle
  2. Ensembles of Decision Trees: *Lecture - *Chapter - *R
  3. Support Vector Machines
  4. Neural Networks
  5. Naive Bayes: Chapter - R
  6. Multivariate Adaptive Regression Splines: Chapter - R
  7. Evaluating Models: *Chapter - *R
  8. Scoring (R)
  9. PMML (R) Exporting Models for Deployment
Part 6: Advanced Analytics
  1. Text Mining: *Chapter - *R - Corpus as tar.gz or zip
  2. Social Network Analysis: Chapter - R
  3. Genetic Programming: Chapter - R
Part 7: Advanced R
  1. Strings: Chapter, R
  2. Dates and Time: *Chapter - *R
  3. Spatial Data *Chapter - *R
  4. Big Data *Chapter - *R
  5. Exploring Different Plots: Chapter - R
  6. Writing Functions: Chapter - R
  7. Parallel Processing: Chapter - R
  8. Environments: *Chapter - R
Part 8: Expert R
  1. Packaging (R) Pulling it Together into a Package

Other great resources for modular approaches to learning R include:


Other Togaware resources:


Local package archive:

Creative Commons License

Shop at Amazon

    The following advertisement from Google is not endorsed by Togaware.