 Data Science Desktop Survival Guide by Graham Williams Desktop Survival Project Home Preface Data Science Introducing R R Constructs R Tasks R Strings R Read, Write, and Create Data Template Data Exploration Data Wrangling Data Visualisation Statistics ML Template ML Scenarios ML Activities ML Applications ML Algorithms Cluster Analysis Decision Trees Computer Vision Graph Data Privacy Literate Data Science Coding with Style Resources Bibliography Index

## Formula to Describe the Goal

20200607 In the context of supporting analytic modelling tasks we identify formula used to describe the model to be built. Typically we will model the target variable on the input variables, so that using any resulting model with a new set of values for the input variables we can predict the value of the target variable.

Using stats::formula() we can automatically construct the formula from the dataset itself if the first column of the dataset is the target variable and the remaining columns are the input variables. Our usual ordering of columns within a dataset place the target variable as the last variable rather than the first. A simple selection of the columns from vars in the reverse order, using base::rev(), will then lead to the right formula automatically.

form <- formula(ds[rev(vars)]) %T>% print()
 ```## rain_tomorrow ~ min_temp + max_temp + rainfall + evaporation + ## sunshine + wind_gust_dir + wind_gust_speed + wind_dir_9am + ## wind_dir_3pm + wind_speed_9am + wind_speed_3pm + humidity_9am + ## humidity_3pm + pressure_9am + pressure_3pm + cloud_9am + ## cloud_3pm + temp_9am + temp_3pm + rain_today ```

The notation used to express the formula begins with the name of the target (rain_tomorrow) followed by a tilde (`~`) followed by the variables that will be used to model the target, each separated by a plus (`+`). The formula indicates that we will fit a model to predict rain_tomorrow from the remaining input variables.

A shorthand for this same formulation is:

 rain_tomorrow ~ .