3.7 Pipe Operator
20210103 A function (Section 3.5) performs an action on input data and returns the results of those actions as the output from the function. They are the verbs of the language—the action words of our sentences.
As we learn new functions we will construct longer sentences that string together a sequence of verbs to deliver the outcomes. This is a powerful programming concept, combining dedicated, well designed and implemented functions, each focused on achieving a specific outcome.
To combine single focus functions into more complex operations we use the powerful concept of pipes. Pipes will be familiar to command line users of Unix and Linux. The idea is to pass the output of one function on as the input to another function. Each function does one task very well, very accurately, and very simply from a user’s point of view. We can pipe together many such specialist functions to deliver very complex and quite sophisticated data transformations in an easily accessible manner.
Pipes were introduced in R through the magrittr (Bache and Wickham 2020) package
(%>%
) and became part of base R in 2021 with version 4.1 (|>
).
To illustrate the concept of pipes recall the contents of the dataset (rattle::weatherAUS):
# Review the dataset of weather observations.
ds
## # A tibble: 176,747 x 24
## date location min_temp max_temp rainfall evaporation sunshine
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2008-12-01 Albury 13.4 22.9 0.6 NA NA
## 2 2008-12-02 Albury 7.4 25.1 0 NA NA
## 3 2008-12-03 Albury 12.9 25.7 0 NA NA
## 4 2008-12-04 Albury 9.2 28 0 NA NA
## 5 2008-12-05 Albury 17.5 32.3 1 NA NA
## 6 2008-12-06 Albury 14.6 29.7 0.2 NA NA
## 7 2008-12-07 Albury 14.3 25 0 NA NA
## 8 2008-12-08 Albury 7.7 26.7 0 NA NA
## 9 2008-12-09 Albury 9.7 31.9 0 NA NA
## 10 2008-12-10 Albury 13.1 30.1 1.4 NA NA
## # … with 176,737 more rows, and 17 more variables: wind_gust_dir <ord>,
## # wind_gust_speed <dbl>, wind_dir_9am <ord>, wind_dir_3pm <ord>,
## # wind_speed_9am <dbl>, wind_speed_3pm <dbl>, humidity_9am <int>,
## # humidity_3pm <int>, pressure_9am <dbl>, pressure_3pm <dbl>,
## # cloud_9am <int>, cloud_3pm <int>, temp_9am <dbl>, temp_3pm <dbl>,
## # rain_today <fct>, risk_mm <dbl>, rain_tomorrow <fct>
We might be interested in the distribution of specific numeric variables. For that we will dplyr::select() a few numeric variables using a pipe.
# Select variables from the dataset.
%>%
ds select(min_temp, max_temp, rainfall, sunshine)
## # A tibble: 176,747 x 4
## min_temp max_temp rainfall sunshine
## <dbl> <dbl> <dbl> <dbl>
## 1 13.4 22.9 0.6 NA
## 2 7.4 25.1 0 NA
## 3 12.9 25.7 0 NA
## 4 9.2 28 0 NA
## 5 17.5 32.3 1 NA
## 6 14.6 29.7 0.2 NA
## 7 14.3 25 0 NA
## 8 7.7 26.7 0 NA
## 9 9.7 31.9 0 NA
## 10 13.1 30.1 1.4 NA
## # … with 176,737 more rows
Typing ds
by itself lists the whole dataset. Piping the whole
dataset to dplyr::select() using the pipe %>%
selects the named variables. The end result returned as the output of
the pipeline is a subset of the original dataset containing just the
named columns.
References
Your donation will support ongoing development and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.