This article contains code snippets for “R” programming language. I am fairly new to “R” and documenting the code snippets helps me to understand it better. Just a word of caution, the snippets are not organized neatly. This is just for my personal reference to keep track of what concepts I learned in R.
Clear variables from work space
Clear Plots
1
| dev.off(dev.list()["RStudioGD"]
|
Basics
Installing libraries
1
| install.packages("ggplot2")
|
Loading in library
As an example, the ggplot2
library can be loaded as follows:
Printing
1
2
3
| # Example of using sprintf() inside print()
x <- 10
print(sprintf("The value of x is %d", x))
|
1
2
3
4
5
| # Example of using paste() inside print()
name <- "John"
age <- 30
print(paste("Name:", name, ", Age:", age))
|
Conditionals
1
2
3
4
5
6
7
8
9
| # Example of an else if statement
x <- 10
if (x > 20) {
print("x is greater than 20")
} else if (x > 10) {
print("x is greater than 10 but less than or equal to 20")
} else {
print("x is less than or equal to 10")
}
|
While loops
1
2
3
4
5
6
| # Example of a while loop
i <- 1
while (i <= 5) {
print(i)
i <- i + 1
}
|
For loops
1
2
3
4
| # Example of a for loop
for (i in 1:5) {
print(i)
}
|
1
2
3
4
5
| # Looping over elements of a vector
element_vector <- c("a", "b", "c", "d", "e")
for (element in element_vector) {
print(element)
}
|
Creating a sequence of numbers given a range
1
2
3
| # To create a sequence from 1 to 10 in steps of 1
sequence <- seq(1, 10, by=1)
print(sequence)
|
Another way to create a sequence is with :
1
2
| # Create a sequence from 1 to 100
i <- 1:100
|
Creating vectors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Creating empty vectors
empty_vec <- c()
# Creating an empty vector of a specific length and type
empty_vector <- vector("numeric", length = 5)
# Creating a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
# Creating a character vector
character_vector <- c("a", "b", "c", "d", "e")
# Creating a logical vector
logical_vector <- c(TRUE, FALSE, TRUE, FALSE, TRUE)
|
Creating matrices
Creating an empty matrix filled with zeros (or any specific values)
1
| mat <- matrix(0,nrow = num_rows, ncol = num_cols)
|
Creating matrices by stacking vectors (vertically and horizontally)
1
2
3
4
5
6
| col1 <- c(1, 4, 7)
col2 <- c(2, 5, 8)
col3 <- c(3, 6, 9)
mat_cbind <- cbind(col1, col2, col3)
mat_rbind <- rbind(col1, col2, col3)
|
Generating samples from a probability distribution
Reference: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/Uniform
1
2
| # Sample 20 values from a uniform distribution which ranges from -1 to 1
x <- runif(20, min = -1, max = 1)
|
Generating samples from a normal distribution
Reference: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/Normal
1
2
| # Sample 100 sample from normal distribution with a specified mean and standard deviation
x <- rnorm(n=100,mean=68.5,sd=5.7)
|
Operations on dataframe
Getting a summary for the data frame
Getting the number of columns in a data frame
1
2
3
| # Get the number of columns in the data frame
num_cols_df <- ncol(data_frame)
print(num_cols_df)
|
Getting the number of rows in a data frame
1
2
3
| # Get the number of rows in the data frame
num_rows_df <- nrow(data_frame)
print(num_rows_df)
|
ggplot commands
Basic ggplot plot
Specifiy the data: In ggplot the first argument is the data frame and the second argument inside aes()
specifies the columns of the data frame that is to be used as the x-axis and y-axis. If the type of plot only involves a single column (for example histogram) we only need to pass in one column as the x-axis.
1
| ggplot(data,aes(x=price))
|
Specify the type of plot: Then we add in the geometry to specify the type of plot. We can specify additional parameters as arguments which will control the look of the plot
1
2
3
| # Plot a histogram where bindwidth=50 and specify the edge and fill colors
ggplot(data,aes(x=price))+
geom_histogram(binwidth = 50,col='#9683F5',fill='#D2CDE9')
|
As an example, here we are plotting a histogram with geom_histogram
. There are way too many parameters to go through. The best approach is to look up the documentation when you need to implement a specific thing. https://ggplot2.tidyverse.org/reference/geom_histogram.html
Clean themes
This is kinda subjective and varies from person to person. But I normally use the following code snippets as a theme. This code snippet needs to be varied for different types of plots.
1
2
3
4
5
6
7
| theme_bw()+
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.major.y = element_line(linetype = "dashed",color = "black"),
panel.grid.minor =element_blank(),
panel.background = element_blank(),
axis.line = element_line(color = "white"),
|
2D density contour plot with heatmap overlay
1
2
3
4
5
6
7
| m<-ggplot(data, aes(x = HEIGHT, y = WEIGHT)) +
geom_point()+
stat_bin2d(bins=80)+
scale_fill_gradient(low="lightblue", high="red")+
m+geom_density_2d()
|
Save an image with ggplot
1
2
| # Saves the last plot as "plot.png" 5x5 image
ggsave("plot.png",width=5,height=5)
|
Statistical Concepts
Regression
Fitting a simple regression model
The following code fit a simple linear regression model of the form:
\[y= \beta_0+\beta_1x+ \epsilon\]
1
2
3
4
5
6
7
8
9
10
| data <- read.csv("data.csv")
# Fit the regression model
model <- lm(y ~ x, data = data)
# Summary of the model
summary(model)
# Coefficients:
coefficients(model)
|
Include non-linear terms in the regression model
Non-linear regression terms can be introduced with the help of the poly()
function. The following code snippet fit a regression of the following form:
\[y=\beta_0+\beta_1 x+\beta_2 x^2+ \beta_3 x^3+ \beta_4 x^4+ \beta_5 x^5\]
1
2
3
4
5
6
| data <- read.csv("data.csv")
model <- lm(y ~ poly(x, 5), data = data)
# Summary
summary(model)
|
Detecting multicollinearity
To detect multicollinearity in regression the variation inflation factor (VIF) can be computed. As a thumb rule if $\text{VIF}>10$, then multicollinearity exists.
1
2
3
4
5
6
7
8
9
| library(car)
# Fit a regression model ....
# Compute VIF
vif_result <- vif(model)
# Print VIF values
print(vif_result)
|
Personalized snippets
Split continuous numerical variables in different classes
1
2
3
4
5
| breaks <- c(10, 20, 40, 60, 90,Inf)
labels <- c("10-20", "21-40", "41-60", "61-80","81+")
# Create a new column with age groups
data$age_group <- cut(data$AGE, breaks = breaks, labels = labels, right = FALSE)
|
In the above example code, there is a continuous age group variable. A new column is created by assigning different edge groups.
Get a random subset from a data frame
1
2
3
| k=10
random_indices <- sample(nrow(data), k)
subset_dataset<- data[random_indices, ]
|