• Data Science Desktop Survival Guide
  • Graham Williams
  • Donate to Receive PDF
  • Preface
    • About this Book
    • Organisation of the Book
    • Acknowledgements
    • Waiver and Copyright
  • 1 Data Science
    • 1.1 Data Wrangling
    • 1.2 Data Analyst
    • 1.3 Data Scientist
    • 1.4 Data Science Through Doing
  • 2 Introducing R and Rattle
    • 2.1 Installing R
    • 2.2 Installing R Libraries
    • 2.3 Installing R with CRAN on Debian/Ubuntu
    • 2.4 Installing Rattle
    • 2.5 Installing Rattle on Linux
    • 2.6 Installing Rattle on Linux Debian/Ubuntu
    • 2.7 Installing Rattle on Linux Homebrew
    • 2.8 Installing Rattle on Linux Snap
    • 2.9 Installing Rattle on Linux Zip Local User
    • 2.10 Installing Rattle on Linux Zip All Users
    • 2.11 Installing Rattle on macOS
    • 2.12 Installing Rattle on macOS Homebrew
    • 2.13 Installing Rattle on Windows
    • 2.14 Installing Rattle on Windows Inno
    • 2.15 Installing Rattle on Windows Zip
    • 2.16 Installing Rattle on Windows Troubleshooting
    • 2.17 Installing Rattle V5 Deprecated
    • 2.18 Installing RStudio
    • 2.19 Introducing RStudio
    • 2.20 Editing Code
    • 2.21 Executing Code
    • 2.22 RStudio Review
    • 2.23 Packages and Libraries
    • 2.24 A Glimpse of the Dataset
    • 2.25 Attaching a Package
    • 2.26 Simplifying Commands
    • 2.27 Working with the Library
    • 2.28 Getting Help
  • 3 R Constructs
    • 3.1 A Data Frame as a Dataset
    • 3.2 A Tibble as a Dataset
    • 3.3 Assignment
    • 3.4 Case Statement
    • 3.5 Commands
    • 3.6 Functions
    • 3.7 Operators
    • 3.8 Pipe Operator
    • 3.9 Pipeline
    • 3.10 Pipeline Construction
    • 3.11 Pipeline Identity Operator
    • 3.12 Pipeline Syntactic Sugar
    • 3.13 Pipes: Assignment Pipe
    • 3.14 Pipes: Exposition Pipe
    • 3.15 Pipes: Tee Pipe
    • 3.16 Pipes: Tee Pipe Intermediate Results
    • 3.17 Pipes: Tee Pipe Load CSV Files
    • 3.18 Pipes and Plots
    • 3.19 Variables
  • 4 R Tasks
    • 4.1 Create Directory
    • 4.2 File Exists
    • 4.3 Format Date
  • 5 Dates
    • 5.1 Dates Setup
    • 5.2 Convert String to Date
    • 5.3 Extract Date from Date Time Object
    • 5.4 Extract Time from Date Time Object
    • 5.5 Extract Year, Month and Day
    • 5.6 Format Dates
    • 5.7 Format Strings
  • 6 Strings
    • 6.1 Strings Setup
    • 6.2 Case Conversion
    • 6.3 Concatenate Strings
    • 6.4 Concatenate NULL
    • 6.5 Glue Strings Together
    • 6.6 Glue Pipelines
    • 6.7 Last Character of a String
    • 6.8 Length of String
    • 6.9 Random Strings
    • 6.10 Regexp Pattern Matching
    • 6.11 Regexp Quantifiers
    • 6.12 Regexp Character Classes
    • 6.13 Sub Strings in Base R
    • 6.14 Sub Strings in the Tidyverse
    • 6.15 Substitute Strings
    • 6.16 Trim and Pad
    • 6.17 Wrapping and Words
  • 7 R Read, Write, and Create
    • 7.1 Data Creation Setup
    • 7.2 Clipboard Data
    • 7.3 CSV Data Reading
    • 7.4 CSV Data Writing
    • 7.5 Excel Data Reading
    • 7.6 Excel Data Writing
    • 7.7 MATLAB Data
    • 7.8 Random Dataset
    • 7.9 Read Strings from a File
    • 7.10 TSV Data
  • 8 Data Template
    • 8.1 Dataset Setup
    • 8.2 Data Glimpse
    • 8.3 Normalise Variable Names
    • 8.4 Variables and Model Target
    • 8.5 Identify Numeric Variables
    • 8.6 Normalise Factor Levels
    • 8.7 Modelling Roles
    • 8.8 Variable Types
    • 8.9 Formula to Describe the Goal
    • 8.10 Missing Values
    • 8.11 Random Seed
    • 8.12 Train, Tune, and Test Datasets
    • 8.13 Review the Dataset
    • 8.14 Data Template
  • 9 Data Exploration
    • 9.1 Exploration Setup
    • 9.2 Counting Groups
    • 9.3 Random Sample
    • 9.4 Constant Variables
    • 9.5 Correlated Numeric Variables
    • 9.6 Missing Values in Rattle
    • 9.7 Selecting Columns
    • 9.8 Selecting Rows
    • 9.9 Shuffle Rows
  • 10 Data Wrangling
    • 10.1 Wrangling Setup
    • 10.2 Wrangling Data Review
    • 10.3 Add Columns
    • 10.4 Add Columns Using Variables
    • 10.5 Add Counts
    • 10.6 Binning
    • 10.7 Cleanup
    • 10.8 Combine Rows
    • 10.9 Counting Groups
    • 10.10 Dollar to Numeric Conversion
    • 10.11 Drop Columns
    • 10.12 Drop Obs with Missing Values
    • 10.13 Extract Column as Vector
    • 10.14 Filter Rows Having Missing Values
    • 10.15 Imputation
    • 10.16 Impute Constant
    • 10.17 Impute Mean/Media/Mode
    • 10.18 Impute Zero/Missing
    • 10.19 Indicator Variables
    • 10.20 Join Categorics
    • 10.21 Lag Variable Calculations
    • 10.22 Missing Value Imputation
    • 10.23 Modify Columns
    • 10.24 Normalise Variables
    • 10.25 Pivot Pairwise Binary Table
    • 10.26 Rename Variables
    • 10.27 Replace Missing Values
    • 10.28 Rescale Data in Rattle
    • 10.29 Rescale Data using Rank
    • 10.30 Rescale Data using Recenter in Rattle
    • 10.31 Subset of Rows Within Groups
    • 10.32 Transforming Data in Rattle
    • 10.33 Data Source
    • 10.34 Data Ingestion
    • 10.35 The Shape of the Dataset
    • 10.36 A Glimpse of the Dataset
    • 10.37 Introducing Template Variables
    • 10.38 Locating Datasets in Memory
    • 10.39 Changing Datasets in Memory
    • 10.40 Reviewing Variable Names
    • 10.41 Effect on Data Storage
    • 10.42 Special Case Variable Name Transformations
    • 10.43 Data Review
    • 10.44 Dataset Head and Tail
    • 10.45 Random Observations
    • 10.46 Characters
    • 10.47 Factors
    • 10.48 Location
    • 10.49 Evaporation and Sunshine
    • 10.50 Wind Directions
    • 10.51 Ordered Factor
    • 10.52 Rain
    • 10.53 Numeric
    • 10.54 Logical
    • 10.55 Variable Roles
    • 10.56 Risk Variable
    • 10.57 ID Variables
    • 10.58 Ignore IDs and Outputs
    • 10.59 Ignore Missing
    • 10.60 Ignore Excessive Level Variables
    • 10.61 Dealing with Correlations
    • 10.62 Removing Ignored Variables
    • 10.63 Feature Selection
    • 10.64 Missing Targets
    • 10.65 Omitting Observations
    • 10.66 Normalise Factors
    • 10.67 Target as a Factor
    • 10.68 Identify Variable Types
    • 10.69 Identify Numeric and Categoric Variables
    • 10.70 Save the Dataset
    • 10.71 A Template for Data Preparation
  • 11 Data Visualisation
    • 11.1 Visualisation Setup
    • 11.2 Visualisation Data
    • 11.3 Visualisation Data Review
    • 11.4 Arranging Plots Quickly with Blanket
    • 11.5 Bar Chart
    • 11.6 Bar Chart Colour No Legend
    • 11.7 Bar Chart Dodge
    • 11.8 Bar Chart Dodge Labelled Colour Brewer
    • 11.9 Bar Chart Faceted Background
    • 11.10 Bar Chart Flipped Colour Mean no Legend
    • 11.11 Bar Chart Flipped Colour Mean Confidence Intervals
    • 11.12 Bar Chart Flipped Sorted Axes
    • 11.13 Bar Chart Flipped Text Annotations
    • 11.14 Bar Chart Flipped Text Annotations Commas
    • 11.15 Bar Chart Labels
    • 11.16 Bar Chart Narrow Bars Economist Theme
    • 11.17 Bar Chart Ordered X Axis
    • 11.18 Bar Chart Stacked
    • 11.19 Bar Chart Supplied Values
    • 11.20 Bar Chart Texts
    • 11.21 Bar Chart Wide Bars
    • 11.22 Bar Chart Wide and Borders
    • 11.23 Bar Chart Showcase Solar
    • 11.24 Benford Plot
    • 11.25 Box Plot
    • 11.26 Box Plot in Rattle
    • 11.27 Colour Names
    • 11.28 Colour Ranges
    • 11.29 Cumulative Plot
    • 11.30 Density Plot
    • 11.31 Density Plot in Rattle
    • 11.32 Dot Plot
    • 11.33 Faceted Location Scatter Plot
    • 11.34 Faceted Location Thin Lines
    • 11.35 Faceted Wind Direction
    • 11.36 Filter Data Within Plot
    • 11.37 Labels and Titles
    • 11.38 Labels with Comma
    • 11.39 Labels with Dates Economist Theme
    • 11.40 Labels with Dollars
    • 11.41 Labels Removed
    • 11.42 Labels Rotated
    • 11.43 Legend Position
    • 11.44 Legend Removal
    • 11.45 Line Chart Basic
    • 11.46 Line Chart Density Distribution
    • 11.47 Line Chart Skewed Distributions
    • 11.48 Line Chart Log X Axis
    • 11.49 Line Chart Log Breaks
    • 11.50 Line Chart Log Ticks
    • 11.51 Line Chart Log Custom Labels
    • 11.52 Mosaic Plot
    • 11.53 Multiple Plots in a Single Plot
    • 11.54 Pairs Plot: Using ggpairs()
    • 11.55 Pie Chart
    • 11.56 Plotting Regions
    • 11.57 Rose Chart
    • 11.58 Rose Chart Discussion
    • 11.59 Save Plot to File
    • 11.60 Scatter Plot
    • 11.61 Scatter Plot Colour Shape Theme BW
    • 11.62 Scatter Plot Colour Choice
    • 11.63 Scatter Plot Smooth Gam
    • 11.64 Scatter Plot Smooth Loess
    • 11.65 Text Path
    • 11.66 Text Path Clock Multiple Plots
    • 11.67 Text Path Density Plot Text Location
    • 11.68 Text Path Smooth Plot
    • 11.69 Themes in Rattle
    • 11.70 Violin Plot
    • 11.71 Violin Plot Embedded Box Plot
    • 11.72 Violin Plot Faceted Location
    • 11.73 XFig Support
  • 12 Statistics
    • 12.1 Analysis of Variance ANOVA
    • 12.2 Correlation Test
    • 12.3 F-Test Two-Sample
    • 12.4 Kolmogorov-Smirnov Test
    • 12.5 t-Test Two-Sample
    • 12.6 Wilcoxon Rank Sum Test
    • 12.7 Wilcoxon Signed Rank Test
  • 13 Spatial Data and Maps
    • 13.1 Geocodes
    • 13.2 Understanding Spatial Data
    • 13.3 Plotting Shapefiles
    • 13.4 Google Maps: Geocoding
  • 14 Model Template
    • 14.1 ML Setup
    • 14.2 ML Data and Variables
    • 14.3 ML Modelling Setup
    • 14.4 ML Data Glimpse
    • 14.5 Model Building
    • 14.6 Predict Class
    • 14.7 Predict Probability
    • 14.8 Evaluating the Model
    • 14.9 Accuracy and Error Rate
    • 14.10 Confusion Matrix
    • 14.11 ROC Chart
    • 14.12 Risk Chart
    • 14.13 Biased Estimate from the Training Dataset
    • 14.14 Step 8: Save the Model to File
    • 14.15 Boilerplate
    • 14.16 Command Summary
    • 14.17 Model Template Further Reading
    • 14.18 Model Template Example
  • 15 ML Scenarios
    • 15.1 Machine Learning Setup
    • 15.2 Multi Arm Bandit
    • 15.3 Reinforcement Learning
    • 15.4 Supervised Learning
    • 15.5 Unsupervised Learning
    • 15.6 Hyper-Parameter Tuning
  • 16 ML Activities
    • 16.1 Classification
    • 16.2 Cluster Analysis
    • 16.3 Outlier Detection
    • 16.4 Prediction
    • 16.5 Text Mining
  • 17 ML Applications
    • 17.1 Airport Gate Assignment
    • 17.2 Article Summarisation
    • 17.3 Facial Recognition
    • 17.4 Gene Detection
    • 17.5 Group Recommendations
    • 17.6 Program Comprehension
    • 17.7 Pull Request Generation
    • 17.8 Reservoir Inflow
    • 17.9 Sentiment Analysis
    • 17.10 Topic Modelling
  • 18 ML Algorithms
    • 18.1 Algorithms Setup
    • 18.2 Algorithms Data and Variables
    • 18.3 Algorithms Data Review
    • 18.4 Collaborative Filtering
    • 18.5 Convolutional Neural Network CNN
    • 18.6 Decision Trees
    • 18.7 Deep Convolutional Generative Adversarial Network
    • 18.8 Generative Adversarial Network
    • 18.9 K Means Clustering
    • 18.10 K Nearest Neighbours
    • 18.11 Linear Regression
    • 18.12 Logistic Regression
    • 18.13 Long Short-Term Memory Neural Networks LSTM
    • 18.14 Multi Layer Perceptron
    • 18.15 Neural Networks
    • 18.16 Naive Bayes
    • 18.17 One-Class Support Vector Machine
    • 18.18 Recurrent Neural Networks
    • 18.19 Residual Neural Network
    • 18.20 Support Vector Machine
  • 19 Cluster Analysis
    • 19.1 Clustering Setup
    • 19.2 Biclustering
    • 19.3 K-Means Clustering
  • 20 Decision Trees
    • 20.1 Decision Trees Setup
    • 20.2 Decision Trees Modelling Setup
    • 20.3 Rattle Startup
    • 20.4 Rattle Weather Dataset
    • 20.5 Rattle Summary of Dataset
    • 20.6 Rattle Model Tab
    • 20.7 Rattle Build Tree
    • 20.8 Interpret RPart Decision Tree
    • 20.9 Rattle View Decision Tree
    • 20.10 Rattle Error Matrix
    • 20.11 Rattle Risk Chart
    • 20.12 Rattle ROC Curve
    • 20.13 Rattle Hand Plots
    • 20.14 Rattle Score Dataset
    • 20.15 Rattle Log
    • 20.16 Rattle GUI to R
    • 20.17 Build a Decision Tree Model
    • 20.18 Summary of the Model
    • 20.19 Complexity Parameter
    • 20.20 Complexity Parameter Plot
    • 20.21 Complexity Parameter Behaviour
    • 20.22 Complexity Parameter 0
    • 20.23 Complexity Parameter Table
    • 20.24 Variable Importance
    • 20.25 Node Details and Surrogates
    • 20.26 Decision Tree Performance
    • 20.27 Rules from Decision Tree
    • 20.28 Rules Using Rpart Plot
    • 20.29 Plot Decision Trees
    • 20.30 Plot Decision Tree Uniformly
    • 20.31 Plot Decision Tree with Extra Information
    • 20.32 Fancy Rpart Plot
    • 20.33 RPart Plot Default Tree
    • 20.34 RPart Plot Favourite
    • 20.35 Enhanced Plot: With Colour
    • 20.36 Enhanced Plots: Label all Nodes
    • 20.37 Enhanced Plots: Labels Below Nodes
    • 20.38 Enhanced Plots: Split Labels
    • 20.39 Enhanced Plots: Interior Labels
    • 20.40 Enhanced Plots: Number of Observations
    • 20.41 Enhanced Plots: Add Percentage of Observations
    • 20.42 Enhanced Plots: Classification Rate
    • 20.43 Enhanced Plots: Add Percentage of Observations
    • 20.44 Enhanced Plots: Misclassification Rate
    • 20.45 Enhanced Plots: Probability per Class
    • 20.46 Enhanced Plots: Add Percentage Observations
    • 20.47 Enhanced Plots: Only Probability Per Class
    • 20.48 Enhanced Plots: Probability of Second Class
    • 20.49 Enhanced Plots: Add Percentage Observations
    • 20.50 Enhanced Plots: Only Probability of Second Class
    • 20.51 Enhanced Plots: Probability of the Class
    • 20.52 Enhanced Plots: Overall Probability
    • 20.53 Enhanced Plots: Percentage of Observations
    • 20.54 Enhanced Plots: Show the Node Numbers
    • 20.55 Enhanced Plots: Show the Node Indicies
    • 20.56 Enhanced Plots: Line up the Leaves