Data Transformation

In this lab, you will learn how to use tidyverse functions to accomplish various data transformation tasks. Data transformation is the process of transforming raw data into a more usable form. It is relatively rare outside of the cloistered setting of a University classroom to have a clean data set ready for analysis. For example, you might want to analyze just Yolo county neighborhoods. Or you might want to discard certain variables from the dataset to reduce clutter. Or you encounter missing data. Some estimate that data cleaning can take upwards of 80% to 90% of your data analysis time.

We will rely on tidyverse to demonstrate data transformation in R because its consistency, together with the goal of increasing productivity, mean that the syntax of tidy functions is typically straightforward to learn. Because of time constraints, we will not be able to go through all of tidyverse’s data transformation functions in this guide. Some you will encounter in other course lab guides. Others you may need to learn on your own. The best resource is the book R for Data Science (RDS), which is free online and was written by tidyverse’s creators. You can also go straight to the tidyverse official website. You can also check out RStudio’s cheat sheets.

Tidyverse package

All of tidyverse’s functions are located in the package tidyverse, which you should have already installed in last lab. Remember that you will install packages once and never more (unless you update R). But, you need to always load the package in your current session using library()

library(tidyverse)

Unlike installing, loading a package using library() needs to happen every time you start a new R session. Install once, never more. Library every time you start R.

Reading data

Let’s bring in some data to help illustrate data transformation. I uploaded a file on GitHub containing the number of Hispanic, white, Asian, and black residents in 2018 in California neighborhoods taken from the United States American Community Survey. To import this file in R, use the tidyverse function read_csv() command.

neighborhoods <- read_csv("https://raw.githubusercontent.com/geo200cn/data/master/week2data.csv")

To write or save data, you would use the function write_csv().

tidyverse can read in more than just csv files. It has a suite of read_ functions that are a part of the subpackage readr. The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. To learn more about these functions, see readr’s dedicated site.

Tibbles

Note the underscore in between read and csv in the code above. There is a base R function called read.csv(). What is the difference? The function read.csv() stores your data into a regular data frame. The function read_csv() stores your data in a special tidy R object - a tibble.

Although the tidyverse works with all data objects, its fundamental object type is the tibble. Tibbles are essentially a special variant of data frames that have desirable properties for printing and joining. To illustrate comparisons between tibbles and regular data frames, let’s bring in the county data we downloaded above but using the base R function read.csv()

neighborhoods.df <- read.csv("https://raw.githubusercontent.com/geo200cn/data/master/week2data.csv")

The object neighborhoods.df is a regular data frame. The first difference between regular data frames and tibbles is how the dataset “looks.” First, the tibble.

neighborhoods

## # A tibble: 7,937 × 7
##         GEOID NAME            hisp nhasn nhblk nhwhite tpopr
##         <dbl> <chr>          <dbl> <dbl> <dbl>   <dbl> <dbl>
##  1 6019001100 Fresno          2033    24   552      81  2760
##  2 6071001600 San Bernardino  5325    11    18      94  5527
##  3 6019000200 Fresno          1935   317   416     153  2870
##  4 6077000801 San Joaquin     4977  1083   400     389  7352
##  5 6019001500 Fresno          1868    36    25     541  2475
##  6 6037204920 Los Angeles     2544     0    27      68  2639
##  7 6077000300 San Joaquin     1216   307   401     291  2240
##  8 6019001000 Fresno          2524   693   842      93  4231
##  9 6037206050 Los Angeles     1550   183   112     127  2010
## 10 6019000400 Fresno          4770   633   106     422  5958
## # ℹ 7,927 more rows

Tibbles have a refined print method that shows only the first 10 rows, and only the columns that fit on the screen. In addition, each column reports its name and type.

Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames. Compare the print output above to what you get with a data frame

neighborhoods.df

Ugly, right?

Another important difference between tidy and base R is that read_csv() will always read variables containing text as character variables. In contrast, the base R function read.csv() will sometimes convert a character variable to a factor. Tidy R figures that if you really wanted a character, you can easily convert it when you load it in. read_csv() will also not convert numbers with a leading zero into an integer. It figures that if you want to convert it into an integer, you can do it after the data are read in. In contrast, read.csv() will assume you want it in an integer. And you know what they say about assuming.

Data transformation

One of the clear advantages of tidy R, at least when you are first learning R, is its assortment of user-friendly data transformation functions, which is a part of its dplyr package. Compared to base R, the beauty of these dplyr functions is that they feature consistent design principles and easily work with non-standard evaluation (i.e., you don’t have to put quotes around variable names).

In the rest of this guide, we’ll go through some of the basic data transformation techniques using the functions found in the package dplyr, which was automatically installed and loaded when you brought in the tidyverse package. These functions can be used for tibbles and regular data frames.

Renaming variables

You will likely encounter a variable in a dataset with a name that is not descriptive. The more descriptive the names, the more efficient your analysis will be and the less likely you are going to make a mistake. To see the names of variables in your dataset, use the names() command.

names(neighborhoods)

## [1] "GEOID"   "NAME"    "hisp"    "nhasn"   "nhblk"   "nhwhite" "tpopr"

Use the command rename() to - what else? - rename a variable! Let’s rename NAME to County.

neighborhoods <- rename(neighborhoods, County = "NAME")

Selecting variables

In practice, most of the data files you will download will contain variables you don’t need. It is easier to work with a smaller dataset as it reduces clutter and clears up memory space, which is important if you are executing complex tasks on a large number of observations. Use the command select() to keep variables by name. Visually, we are doing the following:

Let’s keep County, GEOID, and hisp, nhwhite, nhblk and tpopr from the neighborhoods dataset.

select(neighborhoods, County, GEOID, hisp, nhwhite, nhblk, tpopr)

## # A tibble: 7,937 × 6
##    County              GEOID  hisp nhwhite nhblk tpopr
##    <chr>               <dbl> <dbl>   <dbl> <dbl> <dbl>
##  1 Fresno         6019001100  2033      81   552  2760
##  2 San Bernardino 6071001600  5325      94    18  5527
##  3 Fresno         6019000200  1935     153   416  2870
##  4 San Joaquin    6077000801  4977     389   400  7352
##  5 Fresno         6019001500  1868     541    25  2475
##  6 Los Angeles    6037204920  2544      68    27  2639
##  7 San Joaquin    6077000300  1216     291   401  2240
##  8 Fresno         6019001000  2524      93   842  4231
##  9 Los Angeles    6037206050  1550     127   112  2010
## 10 Fresno         6019000400  4770     422   106  5958
## # ℹ 7,927 more rows

You can use also use select() command to keep variables except for the ones you designate. For example, to keep all variables in neighborhoods except nhasn and save this back into neighborhoods, type in

neighborhoods <- select(neighborhoods, -nhasn)
neighborhoods

The negative sign tells R to exclude the variable.

Creating new variables

The mutate() function allows you to create new variables within your dataset. This is important when you need to transform variables in some way - for example, calculating a ratio or adding two variables together. Visually, you are doing this

You can use the mutate() command to generate as many new variables as you would like. For example, let’s construct three new variables in neighborhoods - the percent of residents who are non-Hispanic white, non-Hispanic black, and Hispanic. Name these variables pwhite, pblack, and phisp, respectively.

mutate(neighborhoods, pwhite = nhwhite/tpopr, pblack = nhblk/tpopr, 
       phisp = hisp/tpopr)

Note that you can create new variables based on the variables you just created in the same line of code. For example, you can create a categorical variable yielding “Majority” if the tract is majority Hispanic and “Not Majority” otherwise after creating the percent Hispanic variable within the same mutate() command. Let’s save these changes back into neighborhoods.

neighborhoods <- mutate(neighborhoods, pwhite = nhwhite/tpopr,  
              pblack = nhblk/tpopr, phisp = hisp/tpopr,
              mhisp = ifelse(phisp > 0.5, "Majority","Not Majority"))
neighborhoods

We used the function ifelse() to create mhisp - the function tells R that if the condition phisp > 0.5 is met, the tract’s value for the variable mhisp will be “Majority”, otherwise it will be “Not Majority”.

Filtering

Filtering means selecting rows/observations based on their values. To filter in R, use the command filter(). Visually, filtering rows looks like.

The first argument in the parentheses of this command is the name of the data frame. The second and any subsequent arguments (separated by commas) are the expressions that filter the data frame. For example, we can select Sacramento county

filter(neighborhoods, County == "Sacramento")

The double equal operator == means equal to. We can also explicitly exclude cases and keep everything else by using the not equal operator !=. The following code excludes Sacramento county.

filter(neighborhoods, County != "Sacramento")

What about filtering if a neighborhood has a value greater than a specified value? For example, neighborhoods with a percent white greater than 0.5 (50%).

filter(neighborhoods, pwhite > 0.5)

What about less than 0.5 (50%)?

filter(neighborhoods, pwhite < 0.5)

Both lines of code do not include neighborhoods that have a percent white equal to 0.5. We include it by using the less than or equal operator <= or greater than or equal operator >=.

filter(neighborhoods, pwhite <= 0.5)

In addition to comparison operators, filtering may also utilize logical operators that make multiple selections. There are three basic logical operators: & (and), | is (or), and ! is (not). We can keep neighborhoods with phisp greater than 0.5 and tpopr greater than 5000 using &.

filter(neighborhoods, phisp > 0.5 & tpopr > 5000)

Use | to keep neighborhoods in Sacramento or Yolo

filter(neighborhoods, County == "Sacramento" | County == "Yolo")

Joining tables

Often you will have two datasets containing the same observations but with different variables. Your goal is to combine the datasets into a single one.

Let’s join a dataset containing some health and air pollution data from California’s CalEnviroScreen program. Read it in from GitHub using our new friend read_csv().

ces <- read_csv("https://raw.githubusercontent.com/geo200cn/data/master/ces.csv")
names(ces)

The dataset contains the poverty rate, air pollution levels (PM2.5), and whether the neighborhood is environmentally disadvantaged according to the CES. To join the objects ces and neighborhoods, we need a unique ID that connects the observations across the two files. This unique is named GEOID in both files. The IDs should be the same data class, which is the case.

class(ces$GEOID)

## [1] "numeric"

class(neighborhoods$GEOID)

## [1] "numeric"

To merge the datasets together, use the function left_join(), which matches pairs of observations whenever their keys or IDs are equal. We match on the variable GEOID and save the merged data set into a new object called neighces.

neighces <- left_join(neighborhoods, ces, by = "GEOID")

The argument by tells R which variable(s) to match rows on, in this case GEOID. You can match on multiple variables and you can also match on a single variable with different variable names (see the left_join() help documentation for how to do this).

Other functions

If you have a dataset with a lot of variables, you might want to put important variables at or close to the front. To relocate variables in your dataset, use the function relocate(). Currently, this is the order of variables in neighborhoods

names(neighborhoods)

##  [1] "GEOID"   "County"  "hisp"    "nhblk"   "nhwhite" "tpopr"   "pwhite" 
##  [8] "pblack"  "phisp"   "mhisp"

Use relocate() to move up tpopr right after GEOID and County

neighborhoods <- relocate(neighborhoods, GEOID, County, tpopr)
names(neighborhoods)

##  [1] "GEOID"   "County"  "tpopr"   "hisp"    "nhblk"   "nhwhite" "pwhite" 
##  [8] "pblack"  "phisp"   "mhisp"

What if you want to arrange rows? Use the arrange() function. Arranging or sorting data refers to the process of ordering rows numerically or alphabetically in a data frame or table by the values of one or more variables. Sorting can make it easier to visually scan raw data, such as for the purposes of identifying extreme or outlier values.

Use the function head() to see what are the observations at the top of your file

head(neighborhoods)

## # A tibble: 6 × 10
##        GEOID County         tpopr  hisp nhblk nhwhite pwhite  pblack phisp mhisp
##        <dbl> <chr>          <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl> <dbl> <chr>
## 1 6019001100 Fresno          2760  2033   552      81 0.0293 0.2     0.737 Majo…
## 2 6071001600 San Bernardino  5527  5325    18      94 0.0170 0.00326 0.963 Majo…
## 3 6019000200 Fresno          2870  1935   416     153 0.0533 0.145   0.674 Majo…
## 4 6077000801 San Joaquin     7352  4977   400     389 0.0529 0.0544  0.677 Majo…
## 5 6019001500 Fresno          2475  1868    25     541 0.219  0.0101  0.755 Majo…
## 6 6037204920 Los Angeles     2639  2544    27      68 0.0258 0.0102  0.964 Majo…

What if we wanted to sort the rows by population size?

neighborhoods <- arrange(neighborhoods, tpopr)
head(neighborhoods)

## # A tibble: 6 × 10
##        GEOID County        tpopr  hisp nhblk nhwhite pwhite pblack  phisp mhisp 
##        <dbl> <chr>         <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <chr> 
## 1 6075980300 San Francisco    66     0     0      55  0.833      0 0      Not M…
## 2 6017031900 El Dorado        75     4     0      69  0.92       0 0.0533 Not M…
## 3 6037575500 Los Angeles      92    80     0      12  0.130      0 0.870  Major…
## 4 6037930101 Los Angeles      92     0     0      48  0.522      0 0      Not M…
## 5 6037980019 Los Angeles     151     0     0     125  0.828      0 0      Not M…
## 6 6065940100 Riverside       166    25     0     135  0.813      0 0.151  Not M…

How would you go about sorting the data set by population size from largest to smallest?

Function conflicts

You’ll learn to understand that R is beautiful. But, it isn’t perfect. One of its annoyances is that because R is open source, R users who contribute packages sometimes create functions with the same names. To demonstrate this, let’s load in the package MASS

library(MASS)

You see the warning? Keep it in mind.

Let’s use the function select() again to keep the variables GEOID and PM2.5 from neighces

select(neighces, GEOID, PM2.5)

## Error in `select()`:
## ! unused arguments (GEOID, PM2.5)

What in the world just happened? By loading MASS, we have overwritten tidyverse’s select() function (hence the warning when you loaded MASS). R will choose the function from the package that was most recently loaded. The normal workaround is to be explicit about which select() function you want by using dplyr::select() as in

dplyr::select(neighces, GEOID, PM2.5)

## # A tibble: 7,937 × 2
##         GEOID PM2.5
##         <dbl> <dbl>
##  1 6019001100  15.4
##  2 6071001600  13.3
##  3 6019000200  15.4
##  4 6077000801  12.5
##  5 6019001500  15.4
##  6 6037204920  12.9
##  7 6077000300  13.4
##  8 6019001000  15.4
##  9 6037206050  12.9
## 10 6019000400  15.4
## # ℹ 7,927 more rows

To unload a package (i.e. “unlibrary” it) use the detach() function as follows.

detach("package:MASS")

Pipes

One of the important innovations from the tidyverse is the pipe operator %>%. Pipes are the workhorse of tidy analyses. Piping allows you to chain together many functions, eliminating the need to define multiple intermediate objects to use as the input to subsequent functions. Pipes are also the primary reason that tidyverse code is fundamentally easier to read than base R code.

It may seem complicated at first, but what the pipe does is actually quite simple. That is, it allows users to write linear code. To illustrate the use of the pipe, consider the following base R code that takes the mean of the log of three numbers

mean(log(c(1, 3, 9)))

## [1] 1.098612

Notice how the numbers c(1, 3, 9) are nested inside log(), which is then nested inside mean(). If you’re reading the code from left-to-right, it means the functions are performed in reverse order from how they are written. If we broke the code down into its three functions, we would actually expect the order of operations to proceed as follows

Concatenate numbers into vector c(...)
Log the numeric vector log(...)
Estimate the mean of the logged numeric vector mean(...)

Now consider how you would do this using pipes.

c(1, 3, 9) %>% 
  log() %>% 
  mean()

## [1] 1.098612

In contrast to the nested base code, the tidy code is linear; in other words, the code appears in the same order (moving from left to right) as the operations are performed. Let’s bring back our original neighborhoods dataset

neighborhoods <- read_csv("https://raw.githubusercontent.com/geo200cn/data/master/week2data.csv")

and use pipes to

rename the variable NAME to County
keep all variables except nhasn
create the percent race/ethnicity and majority Hispanic variables
join the ces file

neighces <- neighborhoods %>%
  rename(County = NAME) %>%
  select(GEOID, County, hisp, nhblk, nhwhite, tpopr) %>%
  mutate(pwhite = nhwhite/tpopr, phisp = hisp/tpopr,
         pblk = nhblk/tpopr, 
         mhisp = ifelse(phisp > 0.5, "Majority","Not Majority")) %>%
  left_join(ces, by = "GEOID")

In the code above, the tibble neighborhoods is piped into the command rename(). This command changes NAME to County. The resulting object gets piped into the command select(). And so on.

Piping makes code clearer, and simultaneously gets rid of the need to define any intermediate objects that you would have needed to keep track of while reading the code.

To get some practice on data transformation, answer the homework questions that are uploaded on Canvas in an RMarkdown file. Submit the R Markdown and its knitted document on Canvas by the assignment due date.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Website created and maintained by Noli Brazil