# setwd("~/Library/CloudStorage/Box-Box/Teaching/R_2023/local/lec3")
library(haven)
Lecture 3
Course Updates
Lecture 1’s link is now available on GitHub
Any Questions? Comments? Suggestions?
Recap from Yesterday
Control structures
- Conditional statements (
if
,else
,ifelse
) - Loops (
for
,while
,repeat
,break
,next
)
- Conditional statements (
Functions:
- Introduction to built-in functions
- Writing custom functions
- The
apply
family of functions (lapply
,sapply
,tapply
, etc.)
Library and Packages
Data Manipulation
base
tidyverse
data.table
Introduction
In this lecture, we will delve into the following topics:
- More in Data Cleaning and Wrangling:
- Revisit the Combining Data sets from yesterday
merge
:base
join
:dplyr
- Compare with
1:m
,m:1
,1:1
in STATA
- Data Reshaping
data.table
- Data Standardization
- Dates:
- Text:
stringr
- Unique Identifier
- Revisit the Combining Data sets from yesterday
- Basic Data Analysis and Data Visualization in R
- Preliminary Data Checks
- Simple Regression
- Results and Visualization
- A Comprehensive Project in Class
- A review for the last 2.5 days
- Do this in the
rmd
file I provide you. - Your output should be a
html
report
For Tomorrow
- Intermediate Econometrics in R Review
- Mathematical Statistics in R
- Distributions
- Simulations
- Intermediate Applied Econometrics
- Testing
- …
- Mathematical Statistics in R
Data Manipulation Again
Merge: Comparison with STATA
In STATA, when merging datasets, you specify the type of merge using:
1:1
: One-to-one join: Each observation in the dataset has a unique identifier, and each identifier in one dataset matches to one and only one identifier in the other dataset.1:m
: One-to-many join: You start with the “many” dataset. For each unique identifier in the “many” dataset, there’s a corresponding single observation in the “one” dataset. When you perform the merge, each of the multiple observations in the “many” dataset with the same identifier gets matched to a single observation in the “one” dataset.m:1
: Many-to-one join: You start with the “one” dataset. For each unique identifier in the “one” dataset, there are multiple corresponding observations in the “many” dataset. When you perform the merge, the single observation in the “one” dataset gets matched to each of the multiple observations in the “many” dataset with the same identifier.
Let’s use examples to understand the link and difference between merging datasets in R and STATA. We will see two cases: 1) all keys
can be matched and 2) some keys
cannot be matched:
- Demo with
R
and save the temporary data into a.dta
file using thehaven
library. - Use
STATA
commands and see if the output matches our expectations.
Every Key
Exist in Both Data Set
All three should be the same.
R Demo using base
Since we need to save datasets, we need to define the working directory and libraries we are going to use as usual.
One-to-One (1:1
) Merge:
<- data.frame(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df1 <- data.frame(ID = c(1, 2, 3), Age = c(25, 30, 28))
df2
write_dta(df1, "df1.dta")
write_dta(df2, "df2.dta")
<- merge(df1, df2, by = "ID")
merged_df write_dta(merged_df, "one_to_one_merge.dta")
Many-to-One (m:1
) Merge:
<- data.frame(ID = c(1, 1, 2, 3), Score = c(85, 90, 88, 92))
many_df <- data.frame(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
one_df
write_dta(many_df, "many_df.dta")
write_dta(one_df, "one_df.dta")
<- merge(many_df, one_df, by = "ID")
merged_df write_dta(merged_df, "many_to_one_merge_R.dta")
One-to-Many (1:m
) Merge:
(Using the same example datasets as for the m:1
merge)
<- merge(one_df, many_df, by = "ID")
merged_df write_dta(merged_df, "one_to_many_merge_R.dta")
STATA Demo
We should expect the perfect match.
R Demo with dplyr
: left_join
and right_join
For this case, the results should be the same. Because both many_df
and one_df
are having information for all IDs. Let us double-check if we are using left_join
, right_join
in R.
First, we’ll create two data frames: teachers
(the “one” dataset) and classes
(the “many” dataset). Then, we’ll perform both m:1 and 1:m merges using both left and right joins, and show the results.
Here’s the R code:
# Load the required library
library(dplyr)
# Create the 'teachers' dataframe (the "one" dataset)
<- data.frame(
teachers TeacherID = c(1, 2),
TeacherName = c("Mr. Smith", "Mrs. Jones")
)
# Create the 'classes' dataframe (the "many" dataset)
<- data.frame(
classes ClassID = c("A", "B", "C"),
ClassName = c("Math", "Science", "English"),
TeacherID = c(1, 1, 2)
)
# m:1 Merge using left join (starting with 'classes' as the base)
<- left_join(classes, teachers, by = "TeacherID")
m1_merge_left
# 1:m Merge using left join (starting with 'teachers' as the base)
# This produces the same result as the m:1 merge but potentially with reordered columns
<- left_join(teachers, classes, by = "TeacherID")
one_m_merge_left
# m:1 Merge using right join (starting with 'classes' as the base)
<- right_join(classes, teachers, by = "TeacherID")
m1_merge_right
# 1:m Merge using right join (starting with 'teachers' as the base)
<- right_join(teachers, classes, by = "TeacherID")
one_m_merge_right
# Print results
list(
m1_merge_left = m1_merge_left,
one_m_merge_left = one_m_merge_left,
m1_merge_right = m1_merge_right,
one_m_merge_right = one_m_merge_right
)
$m1_merge_left
ClassID ClassName TeacherID TeacherName
1 A Math 1 Mr. Smith
2 B Science 1 Mr. Smith
3 C English 2 Mrs. Jones
$one_m_merge_left
TeacherID TeacherName ClassID ClassName
1 1 Mr. Smith A Math
2 1 Mr. Smith B Science
3 2 Mrs. Jones C English
$m1_merge_right
ClassID ClassName TeacherID TeacherName
1 A Math 1 Mr. Smith
2 B Science 1 Mr. Smith
3 C English 2 Mrs. Jones
$one_m_merge_right
TeacherID TeacherName ClassID ClassName
1 1 Mr. Smith A Math
2 1 Mr. Smith B Science
3 2 Mrs. Jones C English
Just like before, everything matched!
Some keys
cannot be matched
Now, let us see the behavior of left_join
when there’s no match for a record in either the left or the right table, and we want this being demonstrated using both dplyr
and base R.
Let’s slightly modify our datasets to have unmatched records:
- In the
teachers
dataframe, let’s add a teacher withTeacherID = 3
named “Mr. Doe” who doesn’t have any associated class. - In the
classes
dataframe, let’s add a class withClassID = "D"
named “History” withTeacherID = 4
, but there’s no teacher withTeacherID = 4
in theteachers
dataframe.
Now, let’s perform the left joins using both dplyr
and base R:
Using dplyr
:
- Perform
left_join
usingclasses
as the base. - Perform
left_join
usingteachers
as the base.
Using base
R:
- Use
merge
withall.x = TRUE
usingclasses
as the base. - Use
merge
withall.x = TRUE
usingteachers
as the base.
# Modify the datasets
<- rbind(teachers, data.frame(TeacherID = 3, TeacherName = "Mr. Doe"))
teachers_modified <- rbind(classes, data.frame(ClassID = "D", ClassName = "History", TeacherID = 4))
classes_modified
# Using dplyr:
<- left_join(classes_modified, teachers_modified, by = "TeacherID")
dplyr_classes_base <- left_join(teachers_modified, classes_modified, by = "TeacherID")
dplyr_teachers_base
# Using base R:
<- merge(classes_modified, teachers_modified, by = "TeacherID", all.x = TRUE)
base_classes_base <- merge(teachers_modified, classes_modified, by = "TeacherID", all.x = TRUE)
base_teachers_base
list(
dplyr_classes_base = dplyr_classes_base,
dplyr_teachers_base = dplyr_teachers_base,
base_classes_base = base_classes_base,
base_teachers_base = base_teachers_base
)
$dplyr_classes_base
ClassID ClassName TeacherID TeacherName
1 A Math 1 Mr. Smith
2 B Science 1 Mr. Smith
3 C English 2 Mrs. Jones
4 D History 4 <NA>
$dplyr_teachers_base
TeacherID TeacherName ClassID ClassName
1 1 Mr. Smith A Math
2 1 Mr. Smith B Science
3 2 Mrs. Jones C English
4 3 Mr. Doe <NA> <NA>
$base_classes_base
TeacherID ClassID ClassName TeacherName
1 1 A Math Mr. Smith
2 1 B Science Mr. Smith
3 2 C English Mrs. Jones
4 4 D History <NA>
$base_teachers_base
TeacherID TeacherName ClassID ClassName
1 1 Mr. Smith A Math
2 1 Mr. Smith B Science
3 2 Mrs. Jones C English
4 3 Mr. Doe <NA> <NA>
NOTES:
In both the dplyr
and base R results, we can observe that where there’s no match for a record in the left table, NA
values are filled in for columns from the right table. Which should be similar to the m:1
results from STATA’s picture below. (NOT EXACT THE SAME BUT SAME IDEA)
Reshaping
Base
R Methods:
reshape
Function:
Base R provides the reshape
function, which can convert data from wide to long format and vice versa.
Wide to Long:
# Sample data <- data.frame( data ID = 1:3, Time1 = c(5, 6, 7), Time2 = c(8, 9, 10) ) print(data)
ID Time1 Time2 1 1 5 8 2 2 6 9 3 3 7 10
# Reshaping to long format <- reshape(data, direction = "long", varying = list(c("Time1", "Time2")), long_data v.names = "Value", idvar = "ID", timevar = "Time") print(long_data)
ID Time Value 1.1 1 1 5 2.1 2 1 6 3.1 3 1 7 1.2 1 2 8 2.2 2 2 9 3.2 3 2 10
Long to Wide:
# Reshaping to wide format <- reshape(long_data, direction = "wide", v.names = "Value", wide_data idvar = "ID", timevar = "Time") print(wide_data)
ID Value.1 Value.2 1.1 1 5 8 2.1 2 6 9 3.1 3 7 10
Using data.table
:
The data.table
library offers an efficient and flexible approach to data reshaping, especially for large datasets.
Melting (Wide to Long):
library(data.table)
# Convert data frame to data table
<- as.data.table(data)
DT
# Melt to long format
<- melt(DT, id.vars = "ID", measure.vars = c("Time1", "Time2"),
melted_data variable.name = "Time", value.name = "Value")
print(melted_data)
ID Time Value
<int> <fctr> <num>
1: 1 Time1 5
2: 2 Time1 6
3: 3 Time1 7
4: 1 Time2 8
5: 2 Time2 9
6: 3 Time2 10
Casting (Long to Wide):
# Cast to wide format
<- dcast(melted_data, ID ~ Time, value.var = "Value")
casted_data print(casted_data)
Key: <ID>
ID Time1 Time2
<int> <num> <num>
1: 1 5 8
2: 2 6 9
3: 3 7 10
Remember, the choice between base R and data.table
methods often depends on your specific needs. data.table
is especially powerful for large datasets due to its efficiency, while base R can be simpler for basic reshaping tasks or for those who are more familiar with its syntax.
Dates
- Date data can be tricky due to various formats and conventions across the world.
- Proper handling is essential for chronological analysis, time series forecasting, and event tracking.
Sys.Date()
: Returns the current date.as.Date()
: Converts a character string into a Date object.format()
: Formats a Date object into a desired character representation.
Examples
# Current date
<- Sys.Date()
current_date print(current_date)
[1] "2025-09-17"
# Convert a string to a date
<- "2023-08-16"
date_str <- as.Date(date_str)
converted_date print(converted_date)
[1] "2023-08-16"
# Format a date
<- format(current_date, format="%B %d, %Y")
formatted_date print(formatted_date)
[1] "September 17, 2025"
# Define starting and ending dates
<- as.Date("2023-01-01")
start_date <- as.Date("2023-01-10")
end_date
# Create a sequence of dates
<- seq(start_date, end_date, by="days")
date_seq
# Convert the sequence to a data frame
<- data.frame(Date = date_seq)
date_df
# Print the data frame
print(date_df)
Date
1 2023-01-01
2 2023-01-02
3 2023-01-03
4 2023-01-04
5 2023-01-05
6 2023-01-06
7 2023-01-07
8 2023-01-08
9 2023-01-09
10 2023-01-10
Strings
Text data often contains noise in the form of special characters, inconsistencies in formatting, and more. Properly cleaning and manipulating such data is crucial.
- The
stringr
package in R offers a host of functions that can aid in this. gsub
inbase
R can do a lot similar functions, I will leave that to you for the future study. Some examples have been provided in the Appendix Section.
Removing Special Characters
- Special characters can be noise in some analyses.
- Removing them can simplify text and aid in other text processing tasks.
str_replace_all()
: Replaces all instances of a pattern in a string.
Examples
library(stringr)
<- "Hello, world! This is a test. #Test123"
text <- str_replace_all(text, "[^[:alnum:][:space:]]", "")
cleaned_text print(cleaned_text)
[1] "Hello world This is a test Test123"
String Matching
- Useful to detect if a string contains certain patterns or characters.
str_detect()
: Detects the presence or absence of a pattern in a string.str_which()
: Returns the indices of strings that match a pattern.str_match()
: Extract matched groups from a string based on a pattern.
Examples
# Detect if a string contains "world"
<- c("Hello world", "Hello R", "R is a world of statistics")
text print(str_detect(text, "world"))
[1] TRUE FALSE TRUE
# Get indices of strings that contain "R"
print(str_which(text, "R"))
[1] 2 3
# Extract matched groups
<- "(\\d{4})-(\\d{2})-(\\d{2})"
pattern <- "Today's date is 2023-08-16."
date_str print(str_match(date_str, pattern))
[,1] [,2] [,3] [,4]
[1,] "2023-08-16" "2023" "08" "16"
Case Conversion
- To ensure uniformity in text data.
str_to_upper()
: Converts strings to upper case.str_to_lower()
: Converts strings to lower case.str_to_title()
: Converts strings to title case.
Examples
<- "Hello, World!"
text
print(str_to_upper(text))
[1] "HELLO, WORLD!"
print(str_to_lower(text))
[1] "hello, world!"
print(str_to_title(text))
[1] "Hello, World!"
Splitting Strings
- To break a string into parts based on a delimiter.
str_split()
: Splits a string into parts.
Examples
<- "apple,banana,grape"
text print(str_split(text, ","))
[[1]]
[1] "apple" "banana" "grape"
Other stringr
functions in R with examples
str_length()
: Computes the length of a string.str_c()
: Concatenates strings.str_sub()
: Extracts or replaces substrings.
library(stringr)
# String length
print(str_length("Hello, world!"))
[1] 13
# Concatenate
print(str_c("Hello", "world", sep=", "))
[1] "Hello, world"
# Substring
print(str_sub("Hello, world!", 1, 5))
[1] "Hello"
Unique Identifiers in R
- A unique identifier (UID) is an identifier that ensures distinctness among all other items.
- UIDs are crucial for indexing, referencing, and joining datasets without confusion.
Functions to Identify Unique and Duplicate Values
unique()
: Returns a vector of unique values.
<- c(1, 2, 2, 3, 4, 4, 4, 5)
vec <- unique(vec)
unique_vals print(unique_vals)
[1] 1 2 3 4 5
duplicated()
: Returns a logical vector indicating whether an element is a duplicate.
<- c(1, 2, 2, 3, 4, 4, 4, 5)
vec <- duplicated(vec)
dupes print(dupes)
[1] FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
distinct()
from the dplyr
package: Used to remove duplicate rows from a data frame or tibble.
library(dplyr)
<- data.frame(name = c("Alice", "Bob", "Alice", "Charlie"), age = c(25, 30, 25, 35))
df <- distinct(df)
distinct_df print(distinct_df)
name age
1 Alice 25
2 Bob 30
3 Charlie 35
Generating Unique Identifiers
seq_along()
: Generate a sequence along an object’s length.
<- c("apple", "banana", "cherry")
vec <- seq_along(vec)
uids print(uids)
[1] 1 2 3
make.unique()
: Generates unique strings by appending numbers.
<- c("apple", "apple", "banana")
vec <- make.unique(vec)
unique_vec print(unique_vec)
[1] "apple" "apple.1" "banana"
Checking for Unique Identifiers with isid()
The isid()
function checks if a given set of variables uniquely identifies the observations in a dataset. Just like the one we are using in STATA
.
- You need to install
eeptools
package for this.
library(eeptools)
<- data.frame(id = c(1, 2, 3, 1), value = c(10, 20, 30, 40))
df # Check if 'id' uniquely identifies the data
isid(df, "id", verbose = TRUE)
Are variables a unique ID?
[1] FALSE
Variables define this many unique rows:
[1] 3
There are this many total rows in the data:
[1] 4
In this example, the isid()
function would return FALSE
because the ‘id’ variable does not uniquely identify each row in the dataset.
Considerations
- When working with large datasets or datasets that will be merged, it’s essential to ensure that the data’s unique identifiers remain consistent.
- Always check for the uniqueness of identifiers, especially before operations like joining or merging datasets, to prevent unintended duplications or omissions.
Basic Data Analysis and Data Visulizations in R
Preliminary Checks
Before diving into any data analysis, it’s essential to understand and clean your data. This involves checking for outliers, handling missing values, and visualizing data distributions.
For our demonstration, we’ll use the
mtcars
dataset.data(mtcars)
Checking for Outliers
Outliers can significantly affect regression results. A simple way to check for outliers is by using boxplots.
boxplot(mtcars$mpg, main="Boxplot of MPG", ylab="Miles Per Gallon")
Points outside the “whiskers” of the boxplot could be potential outliers.
Handling Missing Values
Data often comes with missing values, and it’s crucial to handle them appropriately.
Identifying Missing Values
Use
is.na()
to identify missing values:
<- is.na(mtcars$mpg)
missing_vals sum(missing_vals)
[1] 0
- Handling Strategies
- Remove rows with missing values:
na.omit()
- Impute missing values using mean, median, or a specific strategy.
- Remove rows with missing values:
Data Visualization
Visualizing your data can help in understanding distributions, relationships, and potential issues.
- Histogram: Understand the distribution of a variable.
hist(mtcars$mpg, main="Histogram of MPG", xlab="Miles Per Gallon", col="lightblue")
- Correlation Plot: Understand relationships between variables.
pairs(mtcars[, 1:4], main="Scatterplot Matrix")
- Density Plot: Another way to check the distribution.
ggplot(mtcars, aes(x=mpg)) +
geom_density(fill="blue", alpha=0.5) +
labs(title="Density Plot of MPG", x="Miles Per Gallon")
By conducting these preliminary checks, you ensure that your data is ready for deeper analysis, and any insights or results derived are more likely to be reliable.
Data Analysis and Data Visulization (mtcars
example)
Make sure you’ve installed the packages:
tidyverse
,lmtest
Done the first-round preliminary data check.
Loading the Libraries
library(tidyverse)
library(lmtest)
Running a Simple Regression in R
A regression allows us to understand relationships between variables. The simplest form is the linear regression, represented as:
\[ Y = \beta_0 + \beta_1 X + \epsilon \]
Where: - \(Y\) is the dependent variable. - \(X\) is the independent variable. - \(\beta_0\) is the intercept. - \(\beta_1\) is the slope. - \(\epsilon\) is the error term.
<- lm(mpg ~ wt, data = mtcars)
model summary(model)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The summary()
function provides a detailed summary of the regression results. Here, we are trying to predict mpg
(miles per gallon) using the weight (wt
) of the car, using the mtcars
dataset.
Accessing Regression Results
We can save the summary
of the model first:
<- summary(model) model_summary
And we can see that it is a list
! This is something we’ve already learned.
From this summary object, you can extract:
coefficients
: A matrix where each row represents a predictor (including the intercept) and columns provide details like estimate, standard error, t-value, and p-value.sigma
: The residual standard error.r.squared
: The R-squared value.adj.r.squared
: The adjusted R-squared value.fstatistic
: The F-statistic value and its degrees of freedom.
- Access the Coefficients
# Access using summary list
<- model_summary$coefficients
coeff_matrix print(coeff_matrix)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285126 1.877627 19.857575 8.241799e-19
wt -5.344472 0.559101 -9.559044 1.293959e-10
# Directly access using `coefficients` functions for regression objects
<- coefficients(model) # this will give you only the betas
betas print(betas)
(Intercept) wt
37.285126 -5.344472
- Access the Standard Errors
The standard errors of the coefficients can be extracted from the model’s summary object.
<- coeff_matrix[, "Std. Error"]
standard_errors print(standard_errors)
(Intercept) wt
1.877627 0.559101
- Access the t-statistics
<- summary(model)$coefficients[, "t value"]
t_values print(t_values)
(Intercept) wt
19.857575 -9.559044
- Access the p-values
<- summary(model)$coefficients[, "Pr(>|t|)"]
p_values print(p_values)
(Intercept) wt
8.241799e-19 1.293959e-10
4. Show Basic Regression Figures
Scatter Plot with Regression Line
Visualizing the relationship between the independent and dependent variables can be very insightful.
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point() +
geom_smooth(method="lm", col="red") +
labs(title="Relationship between Car Weight and MPG", x="Weight", y="Miles Per Gallon")
`geom_smooth()` using formula = 'y ~ x'
Residuals vs. Fitted Values
Checking the residuals can help diagnose potential issues with the model.
<- resid(model)
residuals <- fitted(model)
fitted_values
ggplot() +
geom_point(aes(x=fitted_values, y=residuals)) +
labs(title="Residuals vs. Fitted Values", x="Fitted Values", y="Residuals")
Comprehensive Reivew Project
Introduction
This comprehensive project is designed to review the concepts you’ve learned throughout the course. You’ll apply techniques from data cleaning to advanced data analysis. We’ll work with both built-in datasets and an external dataset.
Datasets
- Built-in Dataset:
mtcars
- External Dataset: “client_data.csv”
Workflows
Create a workflow for this project in the following steps:
Under lec3 folder, create a sub folder lec3_proj
Under lec3_proj folder, create two sub folders: code and data
Under data folder, create three sub folders: raw, temp, and cleaned
Your directories should look like this:
- lec3–>
- lec3_proj–>
- code–>
lec3_proj.Rmd
- data–>
- raw
- temp
- cleaned
- code–>
- lec3_proj–>
Tasks
- Preparation
- Download the
lec3_proj.Rmd
from GitHub into your code folder - It is already setup for you with the questions.
- Code in the code chunk.
- Set up your working directory to the folder
~/lec3/lec3_proj/code/
- Data Exploration and Cleaning:
- Generate an external dataset “client_data.csv” with the following code
set.seed(123)
# Number of clients
<- 200
n_clients
# 1. Generate 'name' column
<- data.frame(name = paste0("Client_", seq(1, n_clients)))
client_data
# 2. Generate 'car_bought' column. Initially, just a random sample
$car_bought <- sample(rownames(mtcars), n_clients, replace = TRUE)
client_data
# 3. Generate 'date_purchased' column
$date_purchased <- sample(seq(as.Date('2015/01/01'), as.Date('2022/01/01'), by="day"), n_clients, replace=TRUE)
client_data
# 4. Generate 'income' column
$income <- runif(n_clients, min=30000, max=150000)
client_data
# 5. Ensure richer clients are more likely to buy cars with more horsepower
# Sort mtcars by horsepower
<- rownames(mtcars[order(mtcars$hp), ])
sorted_cars
# Divide clients into groups and assign cars based on sorted horsepower
<- ceiling(n_clients / length(sorted_cars))
split_rows <- client_data[order(-client_data$income), ] # sort by income
client_data $car_bought <- rep(sorted_cars, times = split_rows)[1:n_clients]
client_data
# Show the transformed client_data
head(client_data)
# Write the data to a CSV
write.csv(client_data, "../data/raw/client_data.csv", row.names = FALSE)
- Load the built-in dataset
mtcars
. - Create a column namedID
, which is the same as the rownames. - Identify missing values using
sum
, andis.na()
function. - Visualize and treat any outliers in the
mpg
column of themtcars
dataset. - Hint: usingboxplot()
function, and use?boxplot()
to learn the syntax.
- Combining Datasets:
- Merge the datasets using
car_ID
, andcar_bought
. - Use both
merge
frombase
andjoin
fromdplyr
to combine. Discuss the differences. - Note: the merged dataset should be named as:
merged_data_base
merged_data_dplyr
- Merge the datasets using
- Data Reshaping:
- Use
data.table
to reshape the merged data set. - Save the reshaped data set into the
temp
folder. - Create a summary table calculating the mean
mpg
for each unique value inCar_ID
.
- Use
- Data Standardization:
- Standardize any date columns in
client_data.csv
. - Purchased Date:
as.Date()
- Client’s name: get rid of
Client_
usingstr_replace()
- Standardize any date columns in
- Unique Identifier Check:
- Check if the
name
is a unique identifier using: isid()
length(unique(DATA$UID))==nrow(DATA)
- Check if the
5.5. Save and Use Cleaned Data to Proceed - Save the “merged_data_base” into the cleaned
folder, named as: merged_data_base_cleaned.csv
Regression Analysis:
- Conducting the Regression:
- Using the merged dataset, run a regression predicting
mpg
from another continuous variable.mpg ~ wt
- Visualize the relationship between the chosen predictor and
mpg
using a scatter plot.plot(merged_data_base$wt, merged_data_base$mpg, ...)
abline()
- Using the merged dataset, run a regression predicting
- Matrix Operations with Coefficients:
- Extract the coefficients from the regression model and store them in a matrix.
- new matrix:
matrix()
- extract coefficients:
coefficients(model)
- new matrix:
- Create a 2x2 identity matrix.
- Use this:
identity_matrix <- diag(2)
- Use this:
- Use matrix multiplication (using
%*%
) to multiply the identity matrix by the coefficients matrix. The result will be the coefficients themselves.
- Extract the coefficients from the regression model and store them in a matrix.
- Loops for Analysis (HARD: Don’t Do it for Now):
- Write a
for
loop that iterates over the column names of the merged dataset (excludingmpg
). In each iteration, run a regression usingmpg
as the dependent variable and the current column as the independent variable. Store each coefficient in a vector. - After the loop, visualize the coefficients using a bar plot to see the impact of each variable on
mpg
.
- Write a
- Interpretation:
- Discuss the results. Which variables have the most substantial impact on
mpg
? Are the results consistent with expectations?
- Discuss the results. Which variables have the most substantial impact on
- Conducting the Regression:
Control Structures and Custom Functions:
- Use a
for
loop to calculate the mean of each numeric column in the merged dataset. (HARD: Don’t Do it now). Hints:- To calculate the mean of each numeric column in a dataset:
- Initialization: Start by creating an empty numeric vector to store the mean values.
- Looping: Use a for loop to iterate over each column in the dataset.
- Conditional Check: Within the loop, check if the current column is numeric. You can use the is.numeric() function for this.
- Calculation: If the column is numeric, calculate its mean using the mean() function. Make sure to handle any missing values. Append this mean value to your storage vector.
- Naming: After the loop completes, name each element in your storage vector with the corresponding column name.
- Output: Finally, print the named vector to display the mean values.
- Write a custom function that uses
ifelse
to categorizempg
into “Low”, “Medium”, “High”. Apply this function to the dataset.- Hint:
ifelse(..., ifelse(,...))
- Hint:
- Use a
while
loop to find the first row in the merged dataset wherempg
is above a certain threshold (e.g., 25).
- The
apply
Family:- Use
lapply
to calculate the range of each numeric column in the merged dataset. (HARD, don’t do it now) - hint:
range_function <- function(x) c(min=min(x, na.rm=TRUE), max=max(x, na.rm=TRUE))
- Use
sapply
to get the type of each column in the dataset. - hint:
sapply(data, class)
- Use
- Advanced Data Manipulation:
- Use
tidyverse
functions to filter rows, select columns, and arrange the dataset. - filter: mpg > 20
- select: mpg, wt, gear
- arrange: desc(wt)
- Use
data.table
to efficiently modify the dataset in place.
- Use
Challenging Question
- Integrative Analysis:
- Create a new column in the merged dataset that calculates the age of the car (assumes everyone bought a new car on the purchase_date). Use any necessary libraries/packages.
- Write a custom function that, given a column name, returns a list containing the mean, median, and standard deviation. Apply this to multiple columns using the
sapply
function. - Use control structures to find the average
mpg
for cars that are above and below the median weight. Compare the results and provide an interpretation.
Solution
You can find the solution here.
Appendix
gsub
The gsub()
function is part of base R, and it’s a powerful tool for replacing patterns in strings. gsub()
stands for “global substitution”. It searches for all matches of a pattern in a string and replaces them with a specified replacement string.
Syntax
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
pattern
: The pattern to search for.replacement
: The string to replace the pattern with.x
: The input string.ignore.case
: Should the match be case-insensitive?perl
: Should Perl-compatible regex be used?fixed
: If TRUE, pattern is a string to be matched as is (turns off special characters).useBytes
: Should bytes be used for matching (relevant for non-ASCII strings)?
Examples
- Basic Substitution Replacing “cat” with “dog”:
<- "The cat sat on the mat."
text <- gsub("cat", "dog", text)
new_text print(new_text)
[1] "The dog sat on the mat."
- Removing Special Characters Here’s an example using
gsub()
to remove special characters:
<- "Hello, world! This is a test. #Test123"
text <- gsub("[^[:alnum:][:space:]]", "", text)
cleaned_text print(cleaned_text)
[1] "Hello world This is a test Test123"
- Case-Insensitive Replacement Replacing “world” with “R”, ignoring case:
<- "Hello World"
text <- gsub("world", "R", text, ignore.case = TRUE)
new_text print(new_text)
[1] "Hello R"
- Replace Multiple Spaces with Single Space
<- "This has multiple spaces."
text <- gsub("\\s+", " ", text)
new_text print(new_text)
[1] "This has multiple spaces."
gsub()
is one of the primary string manipulation functions in base R and is often used in scenarios where you don’t want to or can’t rely on external packages.
Reference
Dr. Qingxiao Li’s notes for R-Review 2020
Rodrigo Franco’s notes for R-Review 2021