# Add code here to load all the required libraries with `library()`.
# Do not include any `install.package()` for any required packages in this rmd file.
install.packages("Hmisc")
install.packages("stringdist")
install.packages("corrplot")
install.packages("knitr")
install.packages("kableExtra")
install.packages("formattable")
install.packages("DT")
install.packages("tibble")
install.packages("skimr")
install.packages("caret")
install.packages("robustbase")
library(robustbase)
library(skimr)
library(forcats)
library(knitr)
library(tibble)
library(kableExtra)
library(formattable)
library(DT)
library(corrplot)
library(Hmisc) #for median
library(stringdist)
library(ggplot2)
library(dplyr)
library(tidyr)
library(caret)
library(randomForest)
library(tree)
# Only change the value for SID
# Assign your student id into the variable SID, for example:
SID <- 2348513 # This is an example, replace 2101234 with your actual ID
SIDoffset <- (SID %% 50) + 1 # Your SID mod 50 + 1
View(cars.analysis)
setwd("C:/Users/SLL807/Desktop/Assignment")
load("car-analysis-data.Rda")
# Now subset the car data set
# Pick every 50th observation starting from your offset
# Put into your data frame named mydf (you can rename it)
mydf <- cars.analysis[seq(from=SIDoffset,to=nrow(cars.analysis),by=50),]
1- Summary Statistics: Use functions like summary(), str(), head(), and tail() to get an overview data
2- Missing Values: identify missing values by is.na() func
3- Data Imputation: impute() to fill mising values
4- Duplicates: finding duplicates by unique()
5- Data Types: each column must have correct data type (class()) else convert them e.g as.numeric()
6- Data Normalization: by == if duplicates found like my or My
7- Outlier Detection: by boxplots and removal (if unlikely)
8- Correlation Analysis: Evaluate correlations corr() between numeric variables for dependencies/multicollinearity.
9- Data Distribution: Visualize data distributions with histograms, boxplots, or ggplot2
10- Descriptive Statistics: Calculate descriptive statistics for numerical variables to understand their central tendencies and variability by summary() and other func
11- Documentation and Reporting: Document all findings, transformations, and decisions clearly
12- Data Consistency and Domain-Specific Checks: Assess consistency between related columns to ensure coherence and logical relationships.
13- Cross-Field Validation: Validate relationships between different fields to ensure coherence.
14- Documentation and Reporting: Document all findings, transformations, and decisions.
# 1. Summary func to check the data set
print("Checking Summary Statisctics to gret to know the dataset")
[1] "Checking Summary Statisctics to gret to know the dataset"
cat("\n")
summary_table <- summary(mydf)
# Convert summary output to an HTML table with specified styling for all rows
summary_table_html <- kable(summary_table, digits = 2, format = "html") %>%
kable_styling(full_width = FALSE) %>%
row_spec(0:nrow(summary_table), background = "#000000", color = "#FFFFFF")
summary_table_html
brand | year | mileage | engine_size | automatic_transmission | fuel | drivetrain | min_mpg | max_mpg | damaged | first_owner | navigation_system | bluetooth | third_row_seating | heated_seats | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Length:410 | Min. :1953 | Min. : 0 | Min. :1.200 | Min. :0.0000 | Length:410 | Length:410 | Min. : 0.00 | Min. :-30.00 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. : 5850 | |
Class :character | 1st Qu.:2015 | 1st Qu.: 21328 | 1st Qu.:2.000 | 1st Qu.:1.0000 | Class :character | Class :character | 1st Qu.:18.00 | 1st Qu.: 24.00 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:1.0000 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | 1st Qu.:18991 | |
Mode :character | Median :2019 | Median : 43286 | Median :2.400 | Median :1.0000 | Mode :character | Mode :character | Median :21.00 | Median : 28.00 | Median :0.0000 | Median :1.0000 | Median :0.0000 | Median :1.0000 | Median :0.0000 | Median :0.0000 | Median :28512 | |
NA | Mean :2017 | Mean : 48382 | Mean :2.663 | Mean :0.9195 | NA | NA | Mean :21.42 | Mean : 28.15 | Mean :0.2334 | Mean :0.5037 | Mean :0.4439 | Mean :0.8707 | Mean :0.0878 | Mean :0.4341 | Mean :28399 | |
NA | 3rd Qu.:2021 | 3rd Qu.: 66593 | 3rd Qu.:3.400 | 3rd Qu.:1.0000 | NA | NA | 3rd Qu.:25.00 | 3rd Qu.: 32.00 | 3rd Qu.:0.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:1.0000 | 3rd Qu.:0.0000 | 3rd Qu.:1.0000 | 3rd Qu.:37564 | |
NA | Max. :2023 | Max. :190312 | Max. :6.400 | Max. :1.0000 | NA | NA | Max. :89.00 | Max. :100.00 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :54995 | |
NA | NA | NA | NA's :20 | NA | NA | NA | NA's :55 | NA's :55 | NA's :3 | NA's :7 | NA | NA | NA | NA | NA |
str(mydf)
'data.frame': 410 obs. of 16 variables:
$ brand : chr "Audi" "Kia" "Ford" "Jeep" ...
$ year : num 2022 2021 2021 2022 2019 ...
$ mileage : num 7232 60942 45701 2963 47587 ...
$ engine_size : num 2 1.6 3 3.6 2 5.6 2 3.6 1.6 2 ...
$ automatic_transmission: num 1 1 1 1 1 1 1 1 1 1 ...
$ fuel : chr "Petrol" "Petrol" "Petrol" "Petrol" ...
$ drivetrain : chr "Four-wheel Drive" "Front-wheel Drive" "Four-wheel Drive" "Four-wheel Drive" ...
$ min_mpg : num 28 27 18 21 22 13 0 17 22 24 ...
$ max_mpg : num 36 37 24 29 29 18 14 25 25 33 ...
$ damaged : num 0 1 0 0 0 0 0 1 0 0 ...
$ first_owner : num 1 1 1 1 1 1 1 1 0 1 ...
$ navigation_system : num 0 0 1 0 0 1 0 1 1 0 ...
$ bluetooth : num 1 1 1 0 1 1 1 1 1 1 ...
$ third_row_seating : num 0 0 1 0 0 1 0 0 0 0 ...
$ heated_seats : num 1 0 1 0 0 0 0 0 1 0 ...
$ price : num 37500 15990 46290 44290 28990 ...
# 2. Check for missing values in all columns
missing_values <- colSums(is.na(mydf))
missing_df <- data.frame(variable = names(missing_values), missing_count = missing_values)
missing_df <- missing_df[order(-missing_df$missing_count), ] # Sort by missing count
# Visualize missing values
ggplot(data = missing_df, aes(x = reorder(variable, -missing_count), y = missing_count)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() +
labs(title = "Missing Values per Variable", x = "Variable", y = "Missing Count")
missing_values <- colSums(is.na(mydf))
#missing #issue 1 found: missing values
if(any(missing_values)){
print("Rows contain NA. Imputating NA with Column Median")
}
Warning: coercing argument of type 'double' to logical
[1] "Rows contain NA. Imputating NA with Column Median"
# 3. Solving missing values by imputate
for (col in names(mydf)) {
# Check if the column has missing values
if (any(is.na(mydf[[col]]))) {
# Calculate median of the column excluding NA values
col_median <- median(mydf[[col]], na.rm = TRUE)
# Replace NA values with the median
mydf[[col]][is.na(mydf[[col]])] <- col_median
}
}
# 4. Finding data duplicated ROWS
duplicate_rows <- mydf[duplicated(mydf), ]
if(nrow(duplicate_rows)>0){
print("Duplicate rows found")
print(duplicate_rows)
} else{
print("No Duplicate Rows Found")
}
[1] "No Duplicate Rows Found"
# 5. Correcting Data Types by looking at str() and unique() result:
# brand, automatic_transmission, fuel, drivetrain, damaged, first_owner, navigation_system, bluetooth, third_row_seating, heated_seats are ALL CATEGORICAL
mydf$brand <- as.factor(mydf$brand)
mydf$automatic_transmission <- as.factor(mydf$automatic_transmission)
mydf$fuel <- as.factor(mydf$fuel)
mydf$drivetrain <- as.factor(mydf$drivetrain)
mydf$damaged <- as.factor(mydf$damaged)
mydf$first_owner <- as.factor(mydf$first_owner)
mydf$navigation_system <- as.factor(mydf$navigation_system)
mydf$bluetooth <- as.factor(mydf$bluetooth)
mydf$third_row_seating <- as.factor(mydf$third_row_seating)
mydf$heated_seats <- as.factor(mydf$heated_seats)
##########################################
# 6. Performing Data Normalization/Data Consistency (e.g My==my)
#Issues found:
#(ii). Pertol==Petrol in fuel
mydf$fuel <- fct_collapse(mydf$fuel, Petrol = c("Pertol", "Petrol"))
#(ii). Unknown values should be left as it is for transparency.
#(iii). Here, numerical cols such as Price, max_mpg, min_mpg and mileage can never be negative
mydf$year <- abs(mydf$year)
mydf$mileage <- abs(mydf$mileage)
mydf$engine_size <- abs(mydf$engine_size)
mydf$min_mpg <- abs(mydf$min_mpg)
mydf$max_mpg <- abs(mydf$max_mpg)
mydf$price <- abs(mydf$price)
#(iv). ignore if
# (i) mileage>0 but max_mpg & min_mpg=0, assume car is not working
# (ii)mileage=0 but max_mpg and min_mpg>0, assume max and min mpg is taken as commonly seen mpgs.
#(v). max_mpg>min_mpg else imputate from that car rows
#(vi) DATA CONSISTENCY CHECKS
#Check if 'year' values are within a reasonable range
invalid_years <- mydf$year < 1900 | mydf$year > 2050
# Check for negative mileage or unrealistically high values
invalid_mileage <- mydf$mileage < 0 | mydf$mileage > 500000
# Check for engine sizes that seem unrealistic
invalid_engine_size <- mydf$engine_size <= 0 | mydf$engine_size > 100
# Check for values outside expected range for min and max MPG
invalid_min_max_mpg <- mydf$min_mpg < 0 | mydf$max_mpg < 0 | mydf$min_mpg > mydf$max_mpg
# Check if 'price' values are negative or too high
invalid_price <- mydf$price < 0 | mydf$price > 1e6
# Check for inconsistencies between boolean columns (should be 0 or 1)
invalid_boolean_columns <- mydf[, c("automatic_transmission", "damaged", "first_owner",
"navigation_system", "bluetooth", "third_row_seating",
"heated_seats")]
invalid_boolean_columns <- invalid_boolean_columns !=0 & invalid_boolean_columns != 1
inconsistency_matrix <- cbind(
as.integer(invalid_years),
as.integer(invalid_mileage),
as.integer(invalid_engine_size),
as.integer(invalid_min_max_mpg),
as.integer(invalid_price),
as.integer(rowSums(invalid_boolean_columns))
)
# Identify rows with any inconsistencies
inconsistent_rows <- which(rowSums(inconsistency_matrix) > 0)
# Display rows with inconsistencies
if (length(inconsistent_rows) > 0) {
print("Inconsistent rows:")
print(mydf[inconsistent_rows, ])
} else {
print("No inconsistencies found.")
}
[1] "No inconsistencies found."
print(unique(mydf$fuel))
[1] Petrol GPL Hybrid Unknown Electric Diesel
Levels: Diesel Electric GPL Hybrid Petrol Unknown
7.Data visualisations for both categorical and numeric data
#7. Visualize outliers for numeric columns using boxplots
numerical_columns <- c("year", "mileage", "engine_size", "min_mpg", "max_mpg", "price")
#(i) BOXPLOT FOR NUMERICAL DATA
for (col in numerical_columns) {
boxplot(mydf[[col]], main = col, ylab = col, col = "skyblue", border = "black", notch = TRUE)
}
#(ii) BAR PLOT FOR CATEGORICAL DATA
categorical_columns <- c("brand", "automatic_transmission", "fuel", "drivetrain", "damaged",
"first_owner", "navigation_system", "bluetooth", "third_row_seating",
"heated_seats")
for (col in categorical_columns) {
p<- ggplot(mydf, aes_string(x = col)) +
geom_bar(fill = "skyblue") +
labs(title = paste("Bar plot of", col), x = col, y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
print(p)
}
################################################
#8. OUTLIER DETECTION
# Using boxplot.stats to identify outliers
for (col in numerical_columns) {
box_data <- boxplot.stats(mydf[[col]])
outliers <- box_data$out
cat("Outliers in", col, ":", outliers, "\n")
if(length(outliers)==0){
print("No outliers")
}
}
Outliers in year : 1953 2005 1968 1998 1985 2005 1970 2003 2005
Outliers in mileage : 190312 143404 135750 154744 135629 168155 135201 150084 172844
Outliers in engine_size : 5.6 5.3 6.2 5.7 6.2 5.3 5.6 5.7 5.6 6 5.3 5.6 5.6 6.4 5.7 5.6 6.2
Outliers in min_mpg : 0 35 43 48 38 0 0 0 0 49 11 89 43 48 0
Outliers in max_mpg : 14 43 48 41 0 0 55 100 44 51 0
Outliers in price :
[1] "No outliers"
#8. Cross-Field Analysis and Correlation
# Selecting numerical columns for correlation analysis
numeric_columns <- c("year", "mileage", "engine_size", "min_mpg", "max_mpg", "price")
# Subsetting the dataframe with only numeric columns
numeric_data <- mydf[, numeric_columns]
# Calculating correlation matrix
correlation_matrix <- cor(numeric_data)
# Formatting the correlation matrix for better spacing and alignment
formatted_matrix <- format(correlation_matrix, justify = "centre", digits = 2)
# Printing the formatted matrix
print(formatted_matrix)
year mileage engine_size min_mpg max_mpg price
year " 1.00000" "-0.31756" "-0.07766" " 0.04943" " 0.01251" " 0.43858"
mileage "-0.31756" " 1.00000" " 0.15340" "-0.01624" "-0.00046" "-0.60776"
engine_size "-0.07766" " 0.15340" " 1.00000" "-0.32737" "-0.38183" " 0.30279"
min_mpg " 0.04943" "-0.01624" "-0.32737" " 1.00000" " 0.93285" "-0.17538"
max_mpg " 0.01251" "-0.00046" "-0.38183" " 0.93285" " 1.00000" "-0.21746"
price " 0.43858" "-0.60776" " 0.30279" "-0.17538" "-0.21746" " 1.00000"
#############################################################################
Identified NAs and imputed with each column’s median.
Checked for duplicated rows but did not find any.
Checked structure of data and found many data types which should be categorical and converted them to factors.
Found many data inconsistensies like:
Performed Consistency Checks on both numeric and categorical variables: (a) numeric cols cannot be negative or unrealistic (b) engine size must not be 0 and max_mpg>min_mpg (c) some categorical variables must not be other than 0 or 1
Detected outliers in numeric data using boxplot and boxplot.stats() to identify outliers, most notable ones being: min_mpg: 89 and max_mpg: 100 which were significantly shown farthest from the median.
Correlation Analysis showed highest correlation between max_mpg and min_mpg of ~0.9
Left “Unknown” values as is to not cause bias 9.(i) Grouped data by “brand” with respect to means of each numerical variable.
4.Distribution of ‘price’: - Relationship between ‘price’ and other numerical variables using cor(). - ANOVA to explore how ‘price’ varies across categories of categorical predictors. - Visualisation with other variables
#1. Data Familiarization
# Structure of the dataset
str(mydf)
'data.frame': 410 obs. of 16 variables:
$ brand : Factor w/ 25 levels "Alfa","Audi",..: 2 12 7 11 1 20 4 4 9 1 ...
$ year : num 2022 2021 2021 2022 2019 ...
$ mileage : num 7232 60942 45701 2963 47587 ...
$ engine_size : num 2 1.6 3 3.6 2 5.6 2 3.6 1.6 2 ...
$ automatic_transmission: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ fuel : Factor w/ 6 levels "Diesel","Electric",..: 5 5 5 5 5 5 5 5 5 5 ...
$ drivetrain : Factor w/ 5 levels "2WD","Four-wheel Drive",..: 2 3 2 2 2 4 3 3 3 2 ...
$ min_mpg : num 28 27 18 21 22 13 0 17 22 24 ...
$ max_mpg : num 36 37 24 29 29 18 14 25 25 33 ...
$ damaged : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 2 1 1 ...
$ first_owner : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
$ navigation_system : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 2 1 ...
$ bluetooth : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 2 2 2 ...
$ third_row_seating : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 1 1 ...
$ heated_seats : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 2 1 ...
$ price : num 37500 15990 46290 44290 28990 ...
# Loop through each column and output summary
for (col in names(mydf)) {
cat("Summary for column:", col, "\n")
print(summary(mydf[[col]]))
cat("\n")
}
Summary for column: brand
Alfa Audi BMW Cadillac Chevrolet FIAT Ford
16 23 9 21 12 24 12
Honda Hyundai Jaguar Jeep Kia Land Lexus
28 29 19 13 11 11 14
Maserati Mazda Mercedes-Benz MINI Mitsubishi Nissan Porsche
21 18 11 24 17 21 8
Suzuki Toyota Volkswagen Volvo
4 14 19 11
Summary for column: year
Min. 1st Qu. Median Mean 3rd Qu. Max.
1953 2015 2019 2017 2021 2023
Summary for column: mileage
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 21328 43286 48382 66593 190312
Summary for column: engine_size
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.20 2.00 2.40 2.65 3.30 6.40
Summary for column: automatic_transmission
0 1
33 377
Summary for column: fuel
Diesel Electric GPL Hybrid Petrol Unknown
2 7 5 13 380 3
Summary for column: drivetrain
2WD Four-wheel Drive Front-wheel Drive Rear-wheel Drive Unknown
1 199 130 77 3
Summary for column: min_mpg
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 19.00 21.00 21.37 24.00 89.00
Summary for column: max_mpg
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 25.00 28.00 28.42 31.00 100.00
Summary for column: damaged
0 1
315 95
Summary for column: first_owner
0 1
200 210
Summary for column: navigation_system
0 1
228 182
Summary for column: bluetooth
0 1
53 357
Summary for column: third_row_seating
0 1
374 36
Summary for column: heated_seats
0 1
232 178
Summary for column: price
Min. 1st Qu. Median Mean 3rd Qu. Max.
5850 18991 28512 28399 37564 54995
#############################
#2 Outliers Detection and Correction
# Visualize outliers for numeric columns using boxplots
numerical_columns <- c("year", "mileage", "engine_size", "min_mpg", "max_mpg", "price")
# Function to calculate the distance of outliers from the median
calculate_distance_from_median <- function(column) {
median_val <- median(column)
distances <- abs(column - median_val)
return(distances)
}
# Using boxplot.stats to identify outliers and rank them based on distance from median
for (col in numerical_columns) {
box_data <- boxplot.stats(mydf[[col]])
outliers <- box_data$out
distances_from_median <- calculate_distance_from_median(outliers)
# Rank outliers by their distance from the median in descending order
outliers_ranked <- outliers[order(distances_from_median, decreasing = TRUE)]
cat("Outliers in", col, ":", outliers_ranked, "\n")
if(length(outliers_ranked) == 0) {
print("No outliers")
}
}
Outliers in year : 1953 1968 1970 1985 2005 2005 2005 2003 1998
Outliers in mileage : 190312 172844 168155 135201 135629 135750 143404 154744 150084
Outliers in engine_size : 6.4 6.2 6.2 6.2 6 5.3 5.3 5.3 5.7 5.7 5.7 5.6 5.6 5.6 5.6 5.6 5.6
Outliers in min_mpg : 89 0 0 0 0 0 0 11 49 48 48 43 43 38 35
Outliers in max_mpg : 100 0 0 0 14 55 51 48 41 44 43
Outliers in price :
[1] "No outliers"
result <- mydf[mydf$min_mpg == 89, ]
print(result)
result <- mydf[mydf$max_mpg == 100, ]
print(result)
#result is 1 row, imputing it with median
median_min_mpg <- median(mydf$min_mpg, na.rm = TRUE)
median_max_mpg <- median(mydf$max_mpg, na.rm = TRUE)
mydf$min_mpg[mydf$min_mpg == 89] <- median_min_mpg
mydf$max_mpg[mydf$max_mpg == 100] <- median_max_mpg
###########################################################################################
#3. Univariate Analysis
##################### NUMERIC ################
numerical_cols <- c("year", "mileage", "engine_size", "min_mpg", "max_mpg", "price")
categorical_cols <- c("brand", "automatic_transmission", "fuel", "drivetrain", "damaged",
"first_owner", "navigation_system", "bluetooth", "third_row_seating",
"heated_seats")
# Histograms for numerical variables (separately)
for (col in numerical_cols) {
boxplot(mydf[[col]], main = col, ylab = col, col = "skyblue", border = "black", notch = TRUE)
}
################## CATEGORICAL ####################
# Bar plots for categorical variables (separately)
for (col in categorical_cols) {
p <- ggplot(mydf, aes_string(x = col)) +
geom_bar(fill = "skyblue") +
labs(title = paste("Bar plot of", col), x = col, y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
print(p)
}
#3. Correlation of Variables
########### (a) NUMERICAL CORRELATION (and representaion as corrplot) ########################
# Compute correlations between numerical columns
correlations <- cor(mydf[, sapply(mydf, is.numeric)], method = "pearson")
# Displaying the correlation matrix with corrplot
corrplot(correlations, method = "circle", type = "upper", order = "hclust")
# Exclude diagonal elements (correlation of variables with themselves)
diag(correlations) <- NA
# Get upper triangle of the correlation matrix (excluding diagonal)
upper_tri <- as.data.frame(as.table(correlations))
upper_tri <- upper_tri[upper_tri$Var1 != upper_tri$Var2, ]
# Create a unique identifier for each pair of variables
upper_tri$CombinedVars <- apply(upper_tri[, c("Var1", "Var2")], 1, function(x) paste(sort(x), collapse="-"))
# Remove Var1 and Var2 columns
sorted_correlations <- upper_tri[, c("CombinedVars", "Freq")]
# Aggregate by CombinedVars to get the maximum absolute correlation value
sorted_correlations <- aggregate(Freq ~ CombinedVars, data = sorted_correlations, FUN = max)
# Sort correlations in descending order
sorted_correlations <- sorted_correlations[order(-abs(sorted_correlations$Freq)), ]
print(sorted_correlations)
########### (b) CATEGORICAL CORRELATION (and representaion as heatmap) ########################
# Create a function to calculate Cramer's V
cramers_v <- function(x, y) {
return(assocstats(table(x, y))$cramer)
}
# Create an empty matrix to store the correlation values
correlation_matrix <- matrix(NA, nrow = length(categorical_cols), ncol = length(categorical_cols))
colnames(correlation_matrix) <- rownames(correlation_matrix) <- categorical_cols
# Calculate Cramer's V for each pair of categorical variables
for (i in 1:(length(categorical_cols) - 1)) {
for (j in (i + 1):length(categorical_cols)) {
correlation_matrix[i, j] <- cramers_v(mydf[[categorical_cols[i]]], mydf[[categorical_cols[j]]])
correlation_matrix[j, i] <- correlation_matrix[i, j]
}
}
# Flatten the upper triangle of the correlation matrix to extract pairs and their correlations
upper_triangle <- as.data.frame(as.table(correlation_matrix))
upper_triangle <- upper_triangle[order(-upper_triangle$Freq), ]
# Convert factors to characters (if necessary)
upper_triangle$Var1 <- as.character(upper_triangle$Var1)
upper_triangle$Var2 <- as.character(upper_triangle$Var2)
# Extract unique combinations of Var1, Var2, and Correlation
unique_triangle <- unique(transform(upper_triangle,
Var1 = pmin(Var1, Var2),
Var2 = pmax(Var1, Var2))
)[, c("Var1", "Var2", "Freq")]
unique_triangle
# Convert the matrix to a data frame for plotting
correlation_df <- expand.grid(Var1 = categorical_cols, Var2 = categorical_cols)
correlation_df$Correlation <- as.vector(correlation_matrix)
# Plot heatmap
ggplot(correlation_df, aes(Var1, Var2, fill = Correlation)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Categorical Variables Heatmap", x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
######################################################################
#4.Bivariate and Multivariate Analysis
#Plotting STRONGEST correlation graphs for numerical-numerical vars: (GENERAL)
#1. min_mpg-max_mpg (0.93)
ggplot(mydf, aes(x = min_mpg, y = max_mpg)) +
geom_point() +
labs(x = "min_mpg", y = "max_mpg") +
ggtitle("Scatter plot of min_mpg vs max_mpg")
#2. price-mileage (-0.6)
ggplot(mydf, aes(x = mileage, y = price)) +
geom_point() +
labs(x = "Mileage", y = "Price") +
ggtitle("Scatter plot of Price vs Mileage")
###########################################################################
#Plotting STRONGEST correlation graphs for categorical-categorical vars: (GENERAL)
#1. brand-navigation_system(0.53)
ggplot(mydf, aes(x = brand, fill = navigation_system)) +
geom_bar(position = "dodge", color = "black") +
labs(title = "Brand vs Navigation System", x = "Brand", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
#2. brand_first_owner(0.40)
ggplot(mydf, aes(x = brand, fill = first_owner)) +
geom_bar(position = "dodge", color = "black") +
labs(title = "Brand vs First Owner", x = "Brand", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
#5. # EDA with respect to PRICE
numerical_cols <- c("year", "mileage", "engine_size", "min_mpg", "max_mpg", "price")
categorical_cols <- c("brand", "automatic_transmission", "fuel", "drivetrain", "damaged",
"first_owner", "navigation_system", "bluetooth", "third_row_seating",
"heated_seats")
numerical_vars <- mydf[, sapply(mydf, is.numeric)]
numerical_vars <- numerical_vars[, !names(numerical_vars) %in% "price"]
# (a) Calculate correlations with 'price' for each numerical variable
correlations <- sapply(numerical_vars, function(x) cor(mydf$price, x))
correlations
year mileage engine_size min_mpg max_mpg
0.4385783 -0.6077576 0.3027888 -0.2610210 -0.3093242
# (highest correlations between price and mileage: -0.607 and price and year: 0.438)
# (b) ANOVA to explore how 'price' varies across categories of categorical predictors.############
for(col in categorical_cols){
anova_result <- aov(mydf$price ~ mydf[[col]])
cat("ANOVA between 'price' and '", col, "':\n")
print(summary(anova_result))
}
ANOVA between 'price' and ' brand ':
Df Sum Sq Mean Sq F value Pr(>F)
mydf[[col]] 24 2.040e+10 849886055 8.728 <2e-16 ***
Residuals 385 3.749e+10 97369157
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'price' and ' automatic_transmission ':
Df Sum Sq Mean Sq F value Pr(>F)
mydf[[col]] 1 2.420e+09 2.420e+09 17.8 3.02e-05 ***
Residuals 408 5.546e+10 1.359e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'price' and ' fuel ':
Df Sum Sq Mean Sq F value Pr(>F)
mydf[[col]] 5 1.452e+09 290436405 2.079 0.0671 .
Residuals 404 5.643e+10 139683685
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'price' and ' drivetrain ':
Df Sum Sq Mean Sq F value Pr(>F)
mydf[[col]] 4 1.600e+10 3.999e+09 38.66 <2e-16 ***
Residuals 405 4.189e+10 1.034e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'price' and ' damaged ':
Df Sum Sq Mean Sq F value Pr(>F)
mydf[[col]] 1 2.150e+09 2.150e+09 15.74 8.59e-05 ***
Residuals 408 5.573e+10 1.366e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'price' and ' first_owner ':
Df Sum Sq Mean Sq F value Pr(>F)
mydf[[col]] 1 8.852e+09 8.852e+09 73.66 <2e-16 ***
Residuals 408 4.903e+10 1.202e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'price' and ' navigation_system ':
Df Sum Sq Mean Sq F value Pr(>F)
mydf[[col]] 1 7.626e+09 7.626e+09 61.91 3.26e-14 ***
Residuals 408 5.026e+10 1.232e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'price' and ' bluetooth ':
Df Sum Sq Mean Sq F value Pr(>F)
mydf[[col]] 1 2.155e+09 2.155e+09 15.78 8.43e-05 ***
Residuals 408 5.573e+10 1.366e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'price' and ' third_row_seating ':
Df Sum Sq Mean Sq F value Pr(>F)
mydf[[col]] 1 2.520e+09 2.520e+09 18.57 2.06e-05 ***
Residuals 408 5.536e+10 1.357e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'price' and ' heated_seats ':
Df Sum Sq Mean Sq F value Pr(>F)
mydf[[col]] 1 4.757e+09 4.757e+09 36.53 3.39e-09 ***
Residuals 408 5.313e+10 1.302e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# (Fuel p-value (0.0671) indicates a borderline significance/least significance between them and price)
# (c) Visualisation of Price and other variables
# Plotting Price against all categorical variables
for (col in names(mydf)[sapply(mydf, is.factor)]) {
p <- ggplot(mydf, aes_string(x = col, y = "price")) +
geom_boxplot() +
labs(title = paste("Price vs.", col), x = col, y = "Price") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
print(p)
}
# Plotting Price against all numerical variables
for (col in names(mydf)[sapply(mydf, is.numeric)]) {
if (col != "price") {
p <- ggplot(mydf, aes_string(x = col, y = "price")) +
geom_point() +
labs(title = paste("Price vs.", col), x = col, y = "Price")
print(p)
}
}
#6. EDA with respect to first_owner
# (a) Chi-sq test/Fishers to explore relationships with other categorical variables
numerical_cols <- c("year", "mileage", "engine_size", "min_mpg", "max_mpg", "price")
categorical_cols <- c("brand", "automatic_transmission", "fuel", "drivetrain", "damaged",
"navigation_system", "bluetooth", "third_row_seating", "heated_seats")
for (col in categorical_cols) {
contingency_table <- table(mydf$first_owner, mydf[[col]])
# Check counts in the contingency table
counts_below_5 <- sum(contingency_table < 5)
counts_above_5 <- sum(contingency_table >= 5)
if (counts_below_5 > counts_above_5) {
cat("Column", col, "has most counts below 5. Performing Fisher's Exact Test.\n")
fisher_test <- tryCatch(fisher.test(contingency_table, simulate.p.value = TRUE), error = function(e) e)
if (!inherits(fisher_test, "error")) {
print(fisher_test)
} else {
cat("Fisher's exact test couldn't be performed.\n")
}
} else {
cat("Column", col, "has most counts above or equal to 5. Performing Chi-square Test.\n")
chisq_test <- tryCatch(chisq.test(contingency_table), error = function(e) e)
if (!inherits(chisq_test, "error")) {
print(chisq_test)
} else {
cat("Chi-square test couldn't be performed.\n")
}
}
}
Column brand has most counts above or equal to 5. Performing Chi-square Test.
Warning: Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: contingency_table
X-squared = 66.103, df = 24, p-value = 8.376e-06
Column automatic_transmission has most counts above or equal to 5. Performing Chi-square Test.
Pearson's Chi-squared test with Yates' continuity correction
data: contingency_table
X-squared = 11.661, df = 1, p-value = 0.0006383
Column fuel has most counts below 5. Performing Fisher's Exact Test.
Fisher's Exact Test for Count Data with simulated p-value (based on 2000 replicates)
data: contingency_table
p-value = 0.1024
alternative hypothesis: two.sided
Column drivetrain has most counts above or equal to 5. Performing Chi-square Test.
Warning: Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: contingency_table
X-squared = 9.5315, df = 4, p-value = 0.0491
Column damaged has most counts above or equal to 5. Performing Chi-square Test.
Pearson's Chi-squared test with Yates' continuity correction
data: contingency_table
X-squared = 3.65, df = 1, p-value = 0.05607
Column navigation_system has most counts above or equal to 5. Performing Chi-square Test.
Pearson's Chi-squared test with Yates' continuity correction
data: contingency_table
X-squared = 0.42557, df = 1, p-value = 0.5142
Column bluetooth has most counts above or equal to 5. Performing Chi-square Test.
Pearson's Chi-squared test with Yates' continuity correction
data: contingency_table
X-squared = 1.1531, df = 1, p-value = 0.2829
Column third_row_seating has most counts above or equal to 5. Performing Chi-square Test.
Pearson's Chi-squared test with Yates' continuity correction
data: contingency_table
X-squared = 7.9196, df = 1, p-value = 0.00489
Column heated_seats has most counts above or equal to 5. Performing Chi-square Test.
Pearson's Chi-squared test with Yates' continuity correction
data: contingency_table
X-squared = 5.1003, df = 1, p-value = 0.02392
# Weak Significant relationship between first_owner and these variables:
# fuel: 0.096
# damaged :0.05607
# navigation_sys: 0.51
# bluetooth: 0.28
# (b) Visualisation of first_owner and other variables
# Plotting First Owner against all categorical variables
for (col in names(mydf)[sapply(mydf, is.factor)]) {
if (col != "first_owner") {
p <- ggplot(mydf, aes_string(x = col, fill = "first_owner")) +
geom_bar(position = "dodge") +
labs(title = paste("First Owner vs.", col), x = col, y = "Count") +
scale_fill_discrete(name = "First Owner") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
print(p)
}
}
# Plotting First Owner against all numerical variables
for (col in names(mydf)[sapply(mydf, is.numeric)]) {
p <- ggplot(mydf, aes_string(x = col, fill = "first_owner")) +
geom_boxplot() +
labs(title = paste("First Owner vs.", col), x = "First Owner", y = col)
print(p)
}
NA
NA
There were outliers in year, mileage, engine_size, min_mpg and max_mpg. No outliers in price and it had a normal distribution.
The farthest outlier from the mean in max_mpg and min_mpg were from the same row, so I imputed them with their respective columns median.
Strong negative correlation (using cor() function) between:
Weak significance between price and fuel(0.067) using ANOVA function.
Strong relationship of:
Weak significant relationship between:
First owners had more cars with automatic transmission, petrol four wheel drives, damaged, bluetooth, less with navigation system or third row seating.
First owners possessed more expensive cars, with more recent years, lesser mileage.
Addressing Multi-collinearity Issues (found in EDA) : min_mpg and max_mpg are merged into one to reduce multi-collinearity interfering with the model.
Baseline Linear Regression:
Stepwise Selection with step() Function:
Visualizing Model Improvement:
Reducing if Hetroscedacity found by taking log of prices.
Category Level Reduction:
Model Evaluation:
# Create a new feature using the average of min_mpg and max_mpg
mydf$avg_mpg <- (mydf$min_mpg + mydf$max_mpg) / 2
# Remove min_mpg and max_mpg from the dataset
mydf <- subset(mydf, select = -c(min_mpg, max_mpg))
mydf$avg_mpg
[1] 32.0 32.0 21.0 25.0 25.5 15.5 7.0 21.0 23.5 28.5 37.0 26.5 21.5 26.0 18.0 24.0 24.5 20.0 43.0
[20] 25.5 24.0 21.0 29.5 31.5 17.5 24.5 23.0 24.5 24.5 29.5 23.0 19.0 23.0 26.0 15.5 26.5 24.5 24.5
[39] 18.5 28.5 17.0 26.0 26.0 29.0 34.0 27.0 22.0 29.5 27.0 31.5 23.5 32.5 28.0 26.0 23.5 25.5 24.5
[58] 25.0 31.0 24.5 24.0 29.0 22.0 26.5 24.5 21.5 20.5 22.0 35.5 21.5 17.0 24.5 34.0 30.5 24.5 24.5
[77] 20.5 24.5 35.5 29.0 27.5 23.5 25.0 25.0 24.0 48.0 24.5 22.5 29.0 24.0 23.5 26.0 25.0 28.5 27.0
[96] 24.5 16.5 28.0 24.5 24.5 25.0 24.5 31.0 21.5 24.5 28.0 24.5 29.5 20.5 24.0 20.0 34.0 25.5 29.5
[115] 20.0 30.5 16.5 26.0 28.5 26.5 24.0 21.0 35.5 23.0 24.5 21.5 39.5 30.0 24.5 22.0 27.0 30.5 30.5
[134] 23.5 24.0 30.0 16.5 30.5 24.5 17.0 32.0 33.5 16.5 35.5 19.0 30.0 26.0 19.0 20.0 24.5 24.5 24.5
[153] 10.5 25.0 26.0 33.5 24.5 20.0 24.0 23.0 24.5 29.0 25.0 17.0 24.5 25.5 30.5 24.5 21.0 24.5 24.5
[172] 21.5 24.5 24.5 20.5 24.5 25.0 19.5 23.5 27.0 21.0 22.0 26.0 35.5 30.0 27.0 30.5 19.5 24.5 24.0
[191] 26.0 24.5 26.0 26.0 29.5 23.5 17.5 24.5 30.0 22.0 26.0 21.5 17.5 20.5 28.5 24.5 15.0 0.0 25.0
[210] 21.5 13.5 18.5 21.5 20.0 24.5 20.0 24.5 20.5 22.5 24.0 24.5 19.5 22.0 24.5 29.5 21.5 29.0 24.5
[229] 25.5 21.0 27.5 21.0 24.5 31.5 19.5 24.5 23.5 18.0 20.0 30.5 24.5 24.0 27.0 25.0 29.5 20.0 23.0
[248] 24.5 24.0 27.0 25.5 25.5 23.5 17.0 24.0 29.5 27.0 19.5 27.5 24.5 24.0 30.5 20.0 24.5 25.0 22.0
[267] 28.5 25.0 33.0 30.5 0.0 18.5 25.5 29.0 25.0 20.5 26.0 27.5 19.0 52.0 21.5 26.0 20.5 21.5 23.0
[286] 30.0 31.5 20.0 24.5 23.5 18.5 25.0 26.0 29.0 24.5 23.0 26.0 14.5 24.5 21.5 24.5 24.5 25.5 19.5
[305] 26.5 26.0 25.0 25.5 19.0 21.5 23.0 31.0 35.5 30.5 28.0 34.0 24.5 24.5 26.5 43.5 23.5 32.0 21.5
[324] 34.5 25.5 16.5 31.5 28.0 24.5 24.5 20.0 26.0 26.5 21.5 27.0 25.0 24.5 18.5 18.0 22.0 17.5 26.5
[343] 49.5 24.5 29.5 23.0 28.5 16.5 24.5 30.5 24.5 28.5 21.5 17.0 16.5 24.5 22.0 30.5 20.0 28.5 24.5
[362] 30.0 15.5 27.5 21.5 29.0 24.5 35.0 33.0 28.0 20.5 24.5 28.5 22.5 25.5 20.5 22.0 22.5 19.0 16.0
[381] 23.5 16.5 21.5 15.0 34.0 26.0 24.0 33.0 31.0 19.5 26.5 17.5 15.5 25.5 24.5 19.0 22.5 24.0 0.0
[400] 25.5 24.5 33.0 21.0 33.0 17.5 20.5 22.5 24.5 26.5 29.5
#Building Initial Model
model_lm <- lm(price ~ year+ mileage +engine_size + avg_mpg + brand +automatic_transmission +fuel + drivetrain+ damaged + first_owner + navigation_system + bluetooth + third_row_seating + heated_seats, data = mydf)
# Summary of the model
summary(model_lm)
Call:
lm(formula = price ~ year + mileage + engine_size + avg_mpg +
brand + automatic_transmission + fuel + drivetrain + damaged +
first_owner + navigation_system + bluetooth + third_row_seating +
heated_seats, data = mydf)
Residuals:
Min 1Q Median 3Q Max
-15533.6 -3486.3 -80.7 2922.2 16689.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.303e+06 1.671e+05 -7.796 6.76e-14 ***
year 6.778e+02 8.479e+01 7.994 1.74e-14 ***
mileage -1.452e-01 1.057e-02 -13.747 < 2e-16 ***
engine_size 2.314e+03 3.928e+02 5.891 8.72e-09 ***
avg_mpg -1.267e+02 6.023e+01 -2.103 0.036171 *
brandAudi 4.654e+03 1.770e+03 2.629 0.008914 **
brandBMW -2.347e+03 2.305e+03 -1.018 0.309301
brandCadillac 1.205e+03 1.850e+03 0.652 0.515089
brandChevrolet -6.163e+02 2.205e+03 -0.279 0.780055
brandFIAT -4.772e+03 1.891e+03 -2.523 0.012060 *
brandFord 2.999e+03 2.220e+03 1.351 0.177675
brandHonda -6.814e+02 1.768e+03 -0.385 0.700159
brandHyundai -4.606e+03 1.774e+03 -2.596 0.009819 **
brandJaguar -3.050e+03 1.940e+03 -1.572 0.116720
brandJeep 1.047e+03 2.123e+03 0.493 0.622076
brandKia -6.806e+02 2.139e+03 -0.318 0.750528
brandLand -4.192e+02 2.155e+03 -0.195 0.845885
brandLexus 4.294e+03 2.105e+03 2.040 0.042030 *
brandMaserati 3.586e+03 1.944e+03 1.844 0.065921 .
brandMazda -6.192e+03 1.907e+03 -3.247 0.001274 **
brandMercedes-Benz -1.718e+02 2.242e+03 -0.077 0.938957
brandMINI -1.722e+03 1.835e+03 -0.938 0.348694
brandMitsubishi -8.327e+03 1.925e+03 -4.325 1.97e-05 ***
brandNissan -5.322e+03 1.936e+03 -2.749 0.006276 **
brandPorsche 6.696e+03 2.457e+03 2.726 0.006725 **
brandSuzuki -1.156e+04 3.084e+03 -3.748 0.000207 ***
brandToyota 1.660e+02 2.070e+03 0.080 0.936107
brandVolkswagen -5.328e+02 1.921e+03 -0.277 0.781676
brandVolvo 5.352e+03 2.169e+03 2.468 0.014052 *
automatic_transmission1 -9.919e+02 1.131e+03 -0.877 0.381210
fuelElectric -1.367e+04 4.520e+03 -3.025 0.002662 **
fuelGPL -1.295e+04 4.703e+03 -2.752 0.006210 **
fuelHybrid -5.303e+03 4.324e+03 -1.226 0.220832
fuelPetrol -8.841e+03 3.956e+03 -2.235 0.026049 *
fuelUnknown -5.952e+03 5.201e+03 -1.144 0.253271
drivetrainFour-wheel Drive -2.160e+04 6.771e+03 -3.190 0.001543 **
drivetrainFront-wheel Drive -2.535e+04 6.728e+03 -3.769 0.000191 ***
drivetrainRear-wheel Drive -2.136e+04 6.731e+03 -3.174 0.001632 **
drivetrainUnknown -9.427e+03 6.481e+03 -1.455 0.146643
damaged1 -1.158e+03 6.791e+02 -1.705 0.089077 .
first_owner1 1.457e+03 6.688e+02 2.179 0.029946 *
navigation_system1 3.157e+03 7.320e+02 4.313 2.08e-05 ***
bluetooth1 -1.058e+03 1.021e+03 -1.036 0.300739
third_row_seating1 3.031e+03 1.087e+03 2.789 0.005572 **
heated_seats1 6.545e+02 6.379e+02 1.026 0.305591
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5327 on 365 degrees of freedom
Multiple R-squared: 0.821, Adjusted R-squared: 0.7995
F-statistic: 38.06 on 44 and 365 DF, p-value: < 2.2e-16
step(model_lm)
Start: AIC=7078.43
price ~ year + mileage + engine_size + avg_mpg + brand + automatic_transmission +
fuel + drivetrain + damaged + first_owner + navigation_system +
bluetooth + third_row_seating + heated_seats
Df Sum of Sq RSS AIC
- automatic_transmission 1 21814847 1.0381e+10 7077.3
- heated_seats 1 29872966 1.0389e+10 7077.6
- bluetooth 1 30479476 1.0389e+10 7077.6
<none> 1.0359e+10 7078.4
- damaged 1 82486188 1.0441e+10 7079.7
- avg_mpg 1 125486826 1.0484e+10 7081.4
- first_owner 1 134791797 1.0494e+10 7081.7
- third_row_seating 1 220682717 1.0580e+10 7085.1
- fuel 5 499956618 1.0859e+10 7087.8
- navigation_system 1 527856982 1.0887e+10 7096.8
- drivetrain 4 1120077285 1.1479e+10 7112.5
- engine_size 1 984917215 1.1344e+10 7113.7
- year 1 1813547636 1.2172e+10 7142.6
- brand 24 4773028097 1.5132e+10 7185.8
- mileage 1 5363033737 1.5722e+10 7247.5
Step: AIC=7077.29
price ~ year + mileage + engine_size + avg_mpg + brand + fuel +
drivetrain + damaged + first_owner + navigation_system +
bluetooth + third_row_seating + heated_seats
Df Sum of Sq RSS AIC
- bluetooth 1 25363261 1.0406e+10 7076.3
- heated_seats 1 31558668 1.0412e+10 7076.5
<none> 1.0381e+10 7077.3
- damaged 1 94167027 1.0475e+10 7079.0
- first_owner 1 123622272 1.0504e+10 7080.1
- avg_mpg 1 125750362 1.0506e+10 7080.2
- third_row_seating 1 217807638 1.0598e+10 7083.8
- fuel 5 504953063 1.0886e+10 7086.8
- navigation_system 1 550296692 1.0931e+10 7096.5
- drivetrain 4 1106024579 1.1487e+10 7110.8
- engine_size 1 965584137 1.1346e+10 7111.8
- year 1 1820228375 1.2201e+10 7141.5
- brand 24 4762871074 1.5144e+10 7184.1
- mileage 1 5457731214 1.5838e+10 7248.5
Step: AIC=7076.29
price ~ year + mileage + engine_size + avg_mpg + brand + fuel +
drivetrain + damaged + first_owner + navigation_system +
third_row_seating + heated_seats
Df Sum of Sq RSS AIC
- heated_seats 1 30806016 1.0437e+10 7075.5
<none> 1.0406e+10 7076.3
- damaged 1 101890992 1.0508e+10 7078.3
- avg_mpg 1 127093672 1.0533e+10 7079.3
- first_owner 1 135240036 1.0541e+10 7079.6
- third_row_seating 1 223981399 1.0630e+10 7083.0
- fuel 5 502037419 1.0908e+10 7085.6
- navigation_system 1 529500315 1.0936e+10 7094.6
- drivetrain 4 1081380072 1.1487e+10 7108.8
- engine_size 1 945517276 1.1352e+10 7110.0
- year 1 2043882378 1.2450e+10 7147.8
- brand 24 4773993476 1.5180e+10 7183.1
- mileage 1 5558023780 1.5964e+10 7249.8
Step: AIC=7075.51
price ~ year + mileage + engine_size + avg_mpg + brand + fuel +
drivetrain + damaged + first_owner + navigation_system +
third_row_seating
Df Sum of Sq RSS AIC
<none> 1.0437e+10 7075.5
- damaged 1 104059671 1.0541e+10 7077.6
- avg_mpg 1 132708109 1.0570e+10 7078.7
- first_owner 1 134725615 1.0572e+10 7078.8
- third_row_seating 1 232306444 1.0669e+10 7082.5
- fuel 5 489938383 1.0927e+10 7084.3
- navigation_system 1 664304888 1.1101e+10 7098.8
- engine_size 1 927846608 1.1365e+10 7108.4
- drivetrain 4 1107062007 1.1544e+10 7108.8
- year 1 2161701801 1.2599e+10 7150.7
- brand 24 4857577548 1.5294e+10 7184.2
- mileage 1 5586474177 1.6023e+10 7249.3
Call:
lm(formula = price ~ year + mileage + engine_size + avg_mpg +
brand + fuel + drivetrain + damaged + first_owner + navigation_system +
third_row_seating, data = mydf)
Coefficients:
(Intercept) year mileage
-1.222e+06 6.366e+02 -1.472e-01
engine_size avg_mpg brandAudi
2.222e+03 -1.301e+02 5.015e+03
brandBMW brandCadillac brandChevrolet
-2.240e+03 1.193e+03 -6.135e+02
brandFIAT brandFord brandHonda
-4.824e+03 3.426e+03 -4.267e+02
brandHyundai brandJaguar brandJeep
-4.418e+03 -3.133e+03 1.381e+03
brandKia brandLand brandLexus
-5.110e+02 -6.307e+02 4.454e+03
brandMaserati brandMazda brandMercedes-Benz
3.685e+03 -5.909e+03 -2.014e+02
brandMINI brandMitsubishi brandNissan
-1.359e+03 -8.194e+03 -5.061e+03
brandPorsche brandSuzuki brandToyota
6.904e+03 -1.117e+04 3.858e+02
brandVolkswagen brandVolvo fuelElectric
-5.240e+01 5.513e+03 -1.372e+04
fuelGPL fuelHybrid fuelPetrol
-1.307e+04 -5.573e+03 -8.920e+03
fuelUnknown drivetrainFour-wheel Drive drivetrainFront-wheel Drive
-6.130e+03 -2.085e+04 -2.453e+04
drivetrainRear-wheel Drive drivetrainUnknown damaged1
-2.040e+04 -9.197e+03 -1.288e+03
first_owner1 navigation_system1 third_row_seating1
1.440e+03 3.347e+03 3.103e+03
plot(model_lm)
Warning: not plotting observations with leverage one:
94
model1 <-lm(price ~ year + mileage + engine_size + avg_mpg +
brand + fuel + drivetrain + damaged + first_owner + navigation_system +
third_row_seating, data = mydf)
summary(model1)
Call:
lm(formula = price ~ year + mileage + engine_size + avg_mpg +
brand + fuel + drivetrain + damaged + first_owner + navigation_system +
third_row_seating, data = mydf)
Residuals:
Min 1Q Median 3Q Max
-15026.3 -3563.2 -112.5 2906.2 16379.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.222e+06 1.436e+05 -8.510 4.46e-16 ***
year 6.366e+02 7.292e+01 8.730 < 2e-16 ***
mileage -1.472e-01 1.049e-02 -14.035 < 2e-16 ***
engine_size 2.222e+03 3.884e+02 5.720 2.21e-08 ***
avg_mpg -1.301e+02 6.015e+01 -2.163 0.031173 *
brandAudi 5.015e+03 1.754e+03 2.860 0.004480 **
brandBMW -2.240e+03 2.303e+03 -0.973 0.331331
brandCadillac 1.193e+03 1.846e+03 0.647 0.518311
brandChevrolet -6.135e+02 2.204e+03 -0.278 0.780907
brandFIAT -4.824e+03 1.883e+03 -2.563 0.010786 *
brandFord 3.426e+03 2.192e+03 1.563 0.118944
brandHonda -4.267e+02 1.759e+03 -0.243 0.808466
brandHyundai -4.418e+03 1.769e+03 -2.498 0.012935 *
brandJaguar -3.133e+03 1.933e+03 -1.621 0.105910
brandJeep 1.381e+03 2.080e+03 0.664 0.507007
brandKia -5.110e+02 2.135e+03 -0.239 0.810975
brandLand -6.307e+02 2.150e+03 -0.293 0.769389
brandLexus 4.454e+03 2.097e+03 2.124 0.034328 *
brandMaserati 3.685e+03 1.934e+03 1.906 0.057456 .
brandMazda -5.909e+03 1.897e+03 -3.116 0.001980 **
brandMercedes-Benz -2.014e+02 2.240e+03 -0.090 0.928422
brandMINI -1.359e+03 1.812e+03 -0.750 0.453682
brandMitsubishi -8.194e+03 1.919e+03 -4.269 2.50e-05 ***
brandNissan -5.061e+03 1.928e+03 -2.624 0.009041 **
brandPorsche 6.904e+03 2.430e+03 2.842 0.004738 **
brandSuzuki -1.117e+04 3.042e+03 -3.671 0.000277 ***
brandToyota 3.858e+02 2.064e+03 0.187 0.851790
brandVolkswagen -5.240e+01 1.891e+03 -0.028 0.977914
brandVolvo 5.513e+03 2.163e+03 2.549 0.011216 *
fuelElectric -1.372e+04 4.500e+03 -3.050 0.002456 **
fuelGPL -1.307e+04 4.696e+03 -2.783 0.005668 **
fuelHybrid -5.573e+03 4.304e+03 -1.295 0.196239
fuelPetrol -8.920e+03 3.934e+03 -2.267 0.023944 *
fuelUnknown -6.130e+03 5.166e+03 -1.187 0.236162
drivetrainFour-wheel Drive -2.085e+04 6.682e+03 -3.121 0.001947 **
drivetrainFront-wheel Drive -2.453e+04 6.635e+03 -3.698 0.000251 ***
drivetrainRear-wheel Drive -2.040e+04 6.623e+03 -3.080 0.002227 **
drivetrainUnknown -9.197e+03 6.472e+03 -1.421 0.156152
damaged1 -1.288e+03 6.722e+02 -1.915 0.056205 .
first_owner1 1.440e+03 6.606e+02 2.180 0.029925 *
navigation_system1 3.347e+03 6.916e+02 4.840 1.92e-06 ***
third_row_seating1 3.103e+03 1.084e+03 2.862 0.004450 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5326 on 368 degrees of freedom
Multiple R-squared: 0.8197, Adjusted R-squared: 0.7996
F-statistic: 40.8 on 41 and 368 DF, p-value: < 2.2e-16
set.seed(123)
cv_results <- train(
price ~ year + mileage + engine_size + avg_mpg + brand + fuel + drivetrain + damaged + first_owner + navigation_system + third_row_seating,
data = mydf,
method = "lm",
trControl = trainControl(method = "cv", number = 10)
)
Warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful casesWarning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
print(cv_results)
Linear Regression
410 samples
11 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 367, 370, 368, 369, 369, 370, ...
Resampling results:
RMSE Rsquared MAE
6152.944 0.750616 4533.42
Tuning parameter 'intercept' was held constant at a value of TRUE
# Random Forest
model_rf <- randomForest(price ~ year + mileage + engine_size + avg_mpg + brand + fuel + drivetrain + damaged + first_owner + navigation_system + third_row_seating, data = mydf)
print(model_rf)
Call:
randomForest(formula = price ~ year + mileage + engine_size + avg_mpg + brand + fuel + drivetrain + damaged + first_owner + navigation_system + third_row_seating, data = mydf)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 32687277
% Var explained: 76.85
# Feature importance plot for Random Forest
varImpPlot(model_rf)
#brand_group, year and mileage have significantly higher IncNodePurity values so re the most important predictors in predicting car prices.
# Reduce the brand categories based on prices
mydf <- mydf %>%
mutate(brand_group = case_when(
price <= 25000 ~ "Low_Price",
price > 25000 & price <= 40000 ~ "Mid_Price",
price > 40000 ~ "High_Price",
TRUE ~ "Other"
))
mydf$brand_group <- as.factor(mydf$brand_group)
# Reducing Drivetrain
mydf %>%
count(drivetrain)
# Group levels in the drivetrain variable
mydf <- mydf %>%
mutate(drivetrain_group = fct_collapse(drivetrain,
"Four-wheel Drive" = c("Four-wheel Drive"),
"Front-wheel Drive" = c("Front-wheel Drive"),
"Other" = c("Rear-wheel Drive", "Unknown", "2WD")))
# REDUCING FUEL CATEGORIES
mydf %>% count(fuel)
# Group levels in the fuel variable
mydf <- mydf %>%
mutate(fuel_group = fct_collapse(fuel,
"Petrol" = c("Petrol"),
"Other" = c("Hybrid", "Electric", "GPL", "Unknown", "Diesel")))
model_lm <- lm(price ~ year + mileage + engine_size + avg_mpg + brand_group + automatic_transmission + fuel_group + drivetrain_group + damaged + first_owner + navigation_system + bluetooth + third_row_seating + heated_seats, data = mydf)
# Summary of the model
summary(model_lm)
Call:
lm(formula = price ~ year + mileage + engine_size + avg_mpg +
brand_group + automatic_transmission + fuel_group + drivetrain_group +
damaged + first_owner + navigation_system + bluetooth + third_row_seating +
heated_seats, data = mydf)
Residuals:
Min 1Q Median 3Q Max
-9006.7 -2695.2 -112.8 2900.6 10457.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.988e+05 8.332e+04 -3.586 0.000378 ***
year 1.698e+02 4.153e+01 4.089 5.26e-05 ***
mileage -6.211e-02 8.157e-03 -7.613 2.00e-13 ***
engine_size 1.041e+03 2.437e+02 4.271 2.45e-05 ***
avg_mpg -3.738e+01 4.031e+01 -0.927 0.354403
brand_groupLow_Price -2.163e+04 8.130e+02 -26.605 < 2e-16 ***
brand_groupMid_Price -1.121e+04 5.866e+02 -19.112 < 2e-16 ***
automatic_transmission1 -1.362e+03 7.949e+02 -1.713 0.087461 .
fuel_groupPetrol -6.673e+02 7.605e+02 -0.877 0.380764
drivetrain_groupFour-wheel Drive 2.188e+02 5.651e+02 0.387 0.698774
drivetrain_groupFront-wheel Drive -2.983e+03 6.601e+02 -4.518 8.26e-06 ***
damaged1 -2.454e+02 4.843e+02 -0.507 0.612718
first_owner1 1.206e+03 4.524e+02 2.666 0.007999 **
navigation_system1 6.913e+02 4.731e+02 1.461 0.144718
bluetooth1 1.000e+03 6.774e+02 1.476 0.140626
third_row_seating1 1.045e+03 7.500e+02 1.393 0.164497
heated_seats1 6.042e+01 4.361e+02 0.139 0.889877
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3889 on 393 degrees of freedom
Multiple R-squared: 0.8973, Adjusted R-squared: 0.8931
F-statistic: 214.6 on 16 and 393 DF, p-value: < 2.2e-16
step(model_lm)
Start: AIC=6794.72
price ~ year + mileage + engine_size + avg_mpg + brand_group +
automatic_transmission + fuel_group + drivetrain_group +
damaged + first_owner + navigation_system + bluetooth + third_row_seating +
heated_seats
Df Sum of Sq RSS AIC
- heated_seats 1 2.9035e+05 5.9447e+09 6792.7
- damaged 1 3.8819e+06 5.9482e+09 6793.0
- fuel_group 1 1.1646e+07 5.9560e+09 6793.5
- avg_mpg 1 1.3003e+07 5.9574e+09 6793.6
<none> 5.9444e+09 6794.7
- third_row_seating 1 2.9338e+07 5.9737e+09 6794.7
- navigation_system 1 3.2301e+07 5.9767e+09 6794.9
- bluetooth 1 3.2972e+07 5.9773e+09 6795.0
- automatic_transmission 1 4.4395e+07 5.9888e+09 6795.8
- first_owner 1 1.0748e+08 6.0518e+09 6800.1
- year 1 2.5290e+08 6.1973e+09 6809.8
- engine_size 1 2.7589e+08 6.2202e+09 6811.3
- drivetrain_group 2 5.7537e+08 6.5197e+09 6828.6
- mileage 1 8.7676e+08 6.8211e+09 6849.1
- brand_group 2 1.0723e+10 1.6667e+10 7213.4
Step: AIC=6792.74
price ~ year + mileage + engine_size + avg_mpg + brand_group +
automatic_transmission + fuel_group + drivetrain_group +
damaged + first_owner + navigation_system + bluetooth + third_row_seating
Df Sum of Sq RSS AIC
- damaged 1 3.9266e+06 5.9486e+09 6791.0
- fuel_group 1 1.1523e+07 5.9562e+09 6791.5
- avg_mpg 1 1.3078e+07 5.9577e+09 6791.6
<none> 5.9447e+09 6792.7
- third_row_seating 1 2.9839e+07 5.9745e+09 6792.8
- bluetooth 1 3.3377e+07 5.9780e+09 6793.0
- navigation_system 1 3.7735e+07 5.9824e+09 6793.3
- automatic_transmission 1 4.4780e+07 5.9894e+09 6793.8
- first_owner 1 1.0764e+08 6.0523e+09 6798.1
- year 1 2.5420e+08 6.1989e+09 6807.9
- engine_size 1 2.7810e+08 6.2227e+09 6809.5
- drivetrain_group 2 5.7549e+08 6.5201e+09 6826.6
- mileage 1 8.7808e+08 6.8227e+09 6847.2
- brand_group 2 1.0817e+10 1.6761e+10 7213.7
Step: AIC=6791.01
price ~ year + mileage + engine_size + avg_mpg + brand_group +
automatic_transmission + fuel_group + drivetrain_group +
first_owner + navigation_system + bluetooth + third_row_seating
Df Sum of Sq RSS AIC
- fuel_group 1 1.0927e+07 5.9595e+09 6789.8
- avg_mpg 1 1.4589e+07 5.9632e+09 6790.0
<none> 5.9486e+09 6791.0
- third_row_seating 1 3.0324e+07 5.9789e+09 6791.1
- bluetooth 1 3.1706e+07 5.9803e+09 6791.2
- navigation_system 1 3.7459e+07 5.9860e+09 6791.6
- automatic_transmission 1 4.9547e+07 5.9981e+09 6792.4
- first_owner 1 1.0800e+08 6.0566e+09 6796.4
- year 1 2.5835e+08 6.2069e+09 6806.4
- engine_size 1 2.7781e+08 6.2264e+09 6807.7
- drivetrain_group 2 5.9352e+08 6.5421e+09 6826.0
- mileage 1 9.2023e+08 6.8688e+09 6848.0
- brand_group 2 1.0817e+10 1.6766e+10 7211.8
Step: AIC=6789.76
price ~ year + mileage + engine_size + avg_mpg + brand_group +
automatic_transmission + drivetrain_group + first_owner +
navigation_system + bluetooth + third_row_seating
Df Sum of Sq RSS AIC
- avg_mpg 1 1.1488e+07 5.9710e+09 6788.5
- third_row_seating 1 2.7843e+07 5.9873e+09 6789.7
<none> 5.9595e+09 6789.8
- bluetooth 1 3.1178e+07 5.9907e+09 6789.9
- navigation_system 1 4.2671e+07 6.0022e+09 6790.7
- automatic_transmission 1 4.5640e+07 6.0051e+09 6790.9
- first_owner 1 1.0695e+08 6.0664e+09 6795.1
- year 1 2.5027e+08 6.2098e+09 6804.6
- engine_size 1 2.7770e+08 6.2372e+09 6806.4
- drivetrain_group 2 5.9063e+08 6.5501e+09 6824.5
- mileage 1 9.1163e+08 6.8711e+09 6846.1
- brand_group 2 1.0873e+10 1.6832e+10 7211.5
Step: AIC=6788.55
price ~ year + mileage + engine_size + brand_group + automatic_transmission +
drivetrain_group + first_owner + navigation_system + bluetooth +
third_row_seating
Df Sum of Sq RSS AIC
- third_row_seating 1 2.5017e+07 5.9960e+09 6788.3
- bluetooth 1 2.8965e+07 6.0000e+09 6788.5
<none> 5.9710e+09 6788.5
- automatic_transmission 1 4.4727e+07 6.0157e+09 6789.6
- navigation_system 1 4.6208e+07 6.0172e+09 6789.7
- first_owner 1 1.0500e+08 6.0760e+09 6793.7
- year 1 2.4765e+08 6.2186e+09 6803.2
- engine_size 1 3.4155e+08 6.3125e+09 6809.4
- drivetrain_group 2 6.5420e+08 6.6252e+09 6827.2
- mileage 1 9.1422e+08 6.8852e+09 6845.0
- brand_group 2 1.0995e+10 1.6966e+10 7212.7
Step: AIC=6788.26
price ~ year + mileage + engine_size + brand_group + automatic_transmission +
drivetrain_group + first_owner + navigation_system + bluetooth
Df Sum of Sq RSS AIC
- bluetooth 1 2.8307e+07 6.0243e+09 6788.2
<none> 5.9960e+09 6788.3
- automatic_transmission 1 4.6565e+07 6.0426e+09 6789.4
- navigation_system 1 5.5087e+07 6.0511e+09 6790.0
- first_owner 1 1.2402e+08 6.1200e+09 6794.7
- year 1 2.5292e+08 6.2489e+09 6803.2
- engine_size 1 3.9982e+08 6.3958e+09 6812.7
- drivetrain_group 2 6.3975e+08 6.6358e+09 6825.8
- mileage 1 8.8967e+08 6.8857e+09 6843.0
- brand_group 2 1.1220e+10 1.7216e+10 7216.7
Step: AIC=6788.19
price ~ year + mileage + engine_size + brand_group + automatic_transmission +
drivetrain_group + first_owner + navigation_system
Df Sum of Sq RSS AIC
<none> 6.0243e+09 6788.2
- automatic_transmission 1 5.3777e+07 6.0781e+09 6789.8
- navigation_system 1 7.8374e+07 6.1027e+09 6791.5
- first_owner 1 1.1609e+08 6.1404e+09 6794.0
- year 1 3.7622e+08 6.4005e+09 6811.0
- engine_size 1 3.9799e+08 6.4223e+09 6812.4
- drivetrain_group 2 6.3950e+08 6.6638e+09 6825.6
- mileage 1 8.9146e+08 6.9158e+09 6842.8
- brand_group 2 1.1196e+10 1.7221e+10 7214.8
Call:
lm(formula = price ~ year + mileage + engine_size + brand_group +
automatic_transmission + drivetrain_group + first_owner +
navigation_system, data = mydf)
Coefficients:
(Intercept) year
-3.403e+05 1.899e+02
mileage engine_size
-6.106e-02 1.158e+03
brand_groupLow_Price brand_groupMid_Price
-2.176e+04 -1.123e+04
automatic_transmission1 drivetrain_groupFour-wheel Drive
-1.469e+03 4.823e+02
drivetrain_groupFront-wheel Drive first_owner1
-2.831e+03 1.235e+03
navigation_system1
9.832e+02
plot(model_lm)
NA
model2 <-lm(formula = price ~ year + mileage + engine_size + brand_group +
automatic_transmission + drivetrain_group + first_owner +
navigation_system, data = mydf)
summary(model2)
Call:
lm(formula = price ~ year + mileage + engine_size + brand_group +
automatic_transmission + drivetrain_group + first_owner +
navigation_system, data = mydf)
Residuals:
Min 1Q Median 3Q Max
-9204.9 -2870.2 -75.6 2902.2 10410.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.403e+05 7.658e+04 -4.444 1.14e-05 ***
year 1.899e+02 3.804e+01 4.992 8.96e-07 ***
mileage -6.106e-02 7.946e-03 -7.684 1.21e-13 ***
engine_size 1.158e+03 2.255e+02 5.134 4.44e-07 ***
brand_groupLow_Price -2.176e+04 7.995e+02 -27.214 < 2e-16 ***
brand_groupMid_Price -1.123e+04 5.762e+02 -19.481 < 2e-16 ***
automatic_transmission1 -1.469e+03 7.785e+02 -1.887 0.05985 .
drivetrain_groupFour-wheel Drive 4.823e+02 5.509e+02 0.876 0.38176
drivetrain_groupFront-wheel Drive -2.831e+03 6.442e+02 -4.395 1.42e-05 ***
first_owner1 1.235e+03 4.452e+02 2.773 0.00582 **
navigation_system1 9.832e+02 4.315e+02 2.278 0.02323 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3886 on 399 degrees of freedom
Multiple R-squared: 0.8959, Adjusted R-squared: 0.8933
F-statistic: 343.5 on 10 and 399 DF, p-value: < 2.2e-16
# Cross-validation with caret package (example with 10-fold cross-validation)
set.seed(123)
cv_results <- train(
price ~ year + mileage + engine_size + brand_group + automatic_transmission + drivetrain_group + first_owner + navigation_system,
data = mydf,
method = "lm",
trControl = trainControl(method = "cv", number = 10)
)
print(cv_results)
Linear Regression
410 samples
8 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 367, 370, 368, 369, 369, 370, ...
Resampling results:
RMSE Rsquared MAE
3961.563 0.8934761 3277.534
Tuning parameter 'intercept' was held constant at a value of TRUE
# Random Forest
model_rf <- randomForest(price ~ year + mileage + engine_size + brand_group + automatic_transmission + drivetrain_group + first_owner + navigation_system, data = mydf)
print(model_rf)
Call:
randomForest(formula = price ~ year + mileage + engine_size + brand_group + automatic_transmission + drivetrain_group + first_owner + navigation_system, data = mydf)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 2
Mean of squared residuals: 14304941
% Var explained: 89.87
# Feature importance plot for Random Forest
varImpPlot(model_rf)
#brand_group, year and mileage have significantly higher IncNodePurity values so re the most important predictors in predicting car prices.
Model 2 is better than Model 1 so far.
Model 1 was a maximal model with no heteroscedasticity, but had a large equation, many predictors, and moderate fit statistics (RSE = 5326, R-squared = 0.82, F-statistic = 40.8).
Model 2 reduced the levels of categorical predictors with more than two levels, resulting in fewer predictors and better fit statistics (RSE = 3886, R-squared = 0.89, F-statistic = 343.5).
Model 2 outperforms Model 1 in terms of RMSE, R-squared, and MAE. Model 2 (with 8 predictors) shows lower RMSE (3961.563 vs. 6152.944), higher R-squared (0.893 vs. 0.751), and lower MAE (3277.534 vs. 4533.42) compared to Model 1 (with 11 predictors).
Model 1 shows: Newer cars have higher prices. Higher mileage reduces the price. Larger “engine_size” increases price. Being a first owner increases prices.
Potential Weakness involves
- Alternative Approach:
Apply log transformation to price to reduce heteroscedasticity, resulting in the lowest RSE.
Keep the categorical variables with more than two levels without reduction. Apply log transformation to the price to reduce heteroscedasticity, similar to Model 3.
Model Evaluation: Use cross-validation to evaluate the model’s predictive power. Compare the RSE, R-squared, and F-statistic values with those of Model 3.
Feature Importance Analysis: Conduct a feature importance analysis to understand which predictors are most important in the new model.
# Reducing Hetroscedacity
# Apply transformations to response or predictor variables
mydf$log_Price <- log(mydf$price) # Log transformation on response variable
# Re-fit the model with transformed variables
model_log <- lm(log_Price ~ year + mileage + engine_size + avg_mpg + brand_group + automatic_transmission + fuel_group + drivetrain_group + damaged + first_owner + navigation_system + bluetooth + third_row_seating + heated_seats, data = mydf)
summary(model_log)
Call:
lm(formula = log_Price ~ year + mileage + engine_size + avg_mpg +
brand_group + automatic_transmission + fuel_group + drivetrain_group +
damaged + first_owner + navigation_system + bluetooth + third_row_seating +
heated_seats, data = mydf)
Residuals:
Min 1Q Median 3Q Max
-0.64366 -0.10278 0.01378 0.12192 0.57919
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.622e+00 4.196e+00 -2.055 0.0406 *
year 9.523e-03 2.092e-03 4.553 7.07e-06 ***
mileage -4.139e-06 4.109e-07 -10.073 < 2e-16 ***
engine_size 5.717e-02 1.227e-02 4.658 4.38e-06 ***
avg_mpg -9.803e-04 2.030e-03 -0.483 0.6295
brand_groupLow_Price -6.221e-01 4.095e-02 -15.192 < 2e-16 ***
brand_groupMid_Price -2.191e-01 2.954e-02 -7.416 7.49e-13 ***
automatic_transmission1 -3.656e-02 4.004e-02 -0.913 0.3618
fuel_groupPetrol -3.164e-02 3.830e-02 -0.826 0.4093
drivetrain_groupFour-wheel Drive -2.644e-02 2.846e-02 -0.929 0.3535
drivetrain_groupFront-wheel Drive -2.014e-01 3.325e-02 -6.058 3.23e-09 ***
damaged1 3.720e-04 2.439e-02 0.015 0.9878
first_owner1 4.402e-02 2.278e-02 1.932 0.0541 .
navigation_system1 1.856e-02 2.383e-02 0.779 0.4364
bluetooth1 8.701e-02 3.412e-02 2.550 0.0111 *
third_row_seating1 6.592e-02 3.778e-02 1.745 0.0817 .
heated_seats1 8.844e-03 2.196e-02 0.403 0.6874
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1959 on 393 degrees of freedom
Multiple R-squared: 0.8463, Adjusted R-squared: 0.84
F-statistic: 135.2 on 16 and 393 DF, p-value: < 2.2e-16
step(model_log)
Start: AIC=-1320.16
log_Price ~ year + mileage + engine_size + avg_mpg + brand_group +
automatic_transmission + fuel_group + drivetrain_group +
damaged + first_owner + navigation_system + bluetooth + third_row_seating +
heated_seats
Df Sum of Sq RSS AIC
- damaged 1 0.0000 15.079 -1322.2
- heated_seats 1 0.0062 15.086 -1322.0
- avg_mpg 1 0.0089 15.088 -1321.9
- navigation_system 1 0.0233 15.103 -1321.5
- fuel_group 1 0.0262 15.106 -1321.5
- automatic_transmission 1 0.0320 15.111 -1321.3
<none> 15.079 -1320.2
- third_row_seating 1 0.1169 15.196 -1319.0
- first_owner 1 0.1432 15.223 -1318.3
- bluetooth 1 0.2496 15.329 -1315.4
- year 1 0.7953 15.875 -1301.1
- engine_size 1 0.8324 15.912 -1300.1
- drivetrain_group 2 1.9498 17.029 -1274.3
- mileage 1 3.8930 18.973 -1228.0
- brand_group 2 9.5439 24.623 -1123.1
Step: AIC=-1322.16
log_Price ~ year + mileage + engine_size + avg_mpg + brand_group +
automatic_transmission + fuel_group + drivetrain_group +
first_owner + navigation_system + bluetooth + third_row_seating +
heated_seats
Df Sum of Sq RSS AIC
- heated_seats 1 0.0062 15.086 -1324.0
- avg_mpg 1 0.0090 15.088 -1323.9
- navigation_system 1 0.0233 15.103 -1323.5
- fuel_group 1 0.0263 15.106 -1323.4
- automatic_transmission 1 0.0325 15.112 -1323.3
<none> 15.079 -1322.2
- third_row_seating 1 0.1169 15.196 -1321.0
- first_owner 1 0.1432 15.223 -1320.3
- bluetooth 1 0.2516 15.331 -1317.4
- year 1 0.7972 15.877 -1303.0
- engine_size 1 0.8324 15.912 -1302.1
- drivetrain_group 2 1.9735 17.053 -1275.7
- mileage 1 3.9917 19.071 -1227.9
- brand_group 2 9.5538 24.633 -1125.0
Step: AIC=-1323.99
log_Price ~ year + mileage + engine_size + avg_mpg + brand_group +
automatic_transmission + fuel_group + drivetrain_group +
first_owner + navigation_system + bluetooth + third_row_seating
Df Sum of Sq RSS AIC
- avg_mpg 1 0.0093 15.095 -1325.7
- fuel_group 1 0.0253 15.111 -1325.3
- automatic_transmission 1 0.0338 15.120 -1325.1
- navigation_system 1 0.0345 15.120 -1325.0
<none> 15.086 -1324.0
- third_row_seating 1 0.1209 15.207 -1322.7
- first_owner 1 0.1440 15.230 -1322.1
- bluetooth 1 0.2562 15.342 -1319.1
- year 1 0.8055 15.891 -1304.7
- engine_size 1 0.8276 15.913 -1304.1
- drivetrain_group 2 1.9760 17.062 -1277.5
- mileage 1 4.0038 19.089 -1229.5
- brand_group 2 9.7107 24.796 -1124.2
Step: AIC=-1325.74
log_Price ~ year + mileage + engine_size + brand_group + automatic_transmission +
fuel_group + drivetrain_group + first_owner + navigation_system +
bluetooth + third_row_seating
Df Sum of Sq RSS AIC
- fuel_group 1 0.0216 15.117 -1327.2
- automatic_transmission 1 0.0327 15.128 -1326.8
- navigation_system 1 0.0380 15.133 -1326.7
<none> 15.095 -1325.7
- third_row_seating 1 0.1156 15.210 -1324.6
- first_owner 1 0.1419 15.237 -1323.9
- bluetooth 1 0.2510 15.346 -1321.0
- year 1 0.7999 15.895 -1306.6
- engine_size 1 0.9610 16.056 -1302.4
- drivetrain_group 2 2.0939 17.189 -1276.5
- mileage 1 4.0052 19.100 -1231.2
- brand_group 2 9.7842 24.879 -1124.9
Step: AIC=-1327.15
log_Price ~ year + mileage + engine_size + brand_group + automatic_transmission +
drivetrain_group + first_owner + navigation_system + bluetooth +
third_row_seating
Df Sum of Sq RSS AIC
- automatic_transmission 1 0.0283 15.145 -1328.4
- navigation_system 1 0.0444 15.161 -1328.0
<none> 15.117 -1327.2
- third_row_seating 1 0.1100 15.226 -1326.2
- first_owner 1 0.1407 15.257 -1325.3
- bluetooth 1 0.2502 15.367 -1322.4
- year 1 0.7820 15.899 -1308.5
- engine_size 1 0.9502 16.067 -1304.2
- drivetrain_group 2 2.0775 17.194 -1278.3
- mileage 1 3.9843 19.101 -1233.2
- brand_group 2 9.8263 24.943 -1125.8
Step: AIC=-1328.38
log_Price ~ year + mileage + engine_size + brand_group + drivetrain_group +
first_owner + navigation_system + bluetooth + third_row_seating
Df Sum of Sq RSS AIC
- navigation_system 1 0.0455 15.190 -1329.2
<none> 15.145 -1328.4
- third_row_seating 1 0.1130 15.258 -1327.3
- first_owner 1 0.1294 15.274 -1326.9
- bluetooth 1 0.2679 15.413 -1323.2
- year 1 0.7624 15.907 -1310.2
- engine_size 1 0.9249 16.070 -1306.1
- drivetrain_group 2 2.0492 17.194 -1280.3
- mileage 1 4.1082 19.253 -1232.0
- brand_group 2 9.8053 24.950 -1127.7
Step: AIC=-1329.15
log_Price ~ year + mileage + engine_size + brand_group + drivetrain_group +
first_owner + bluetooth + third_row_seating
Df Sum of Sq RSS AIC
<none> 15.190 -1329.2
- first_owner 1 0.1256 15.316 -1327.8
- third_row_seating 1 0.1320 15.322 -1327.6
- bluetooth 1 0.3382 15.529 -1322.1
- year 1 0.7574 15.948 -1311.2
- engine_size 1 0.9746 16.165 -1305.7
- drivetrain_group 2 2.1932 17.384 -1277.9
- mileage 1 4.0643 19.255 -1233.9
- brand_group 2 10.4955 25.686 -1117.8
Call:
lm(formula = log_Price ~ year + mileage + engine_size + brand_group +
drivetrain_group + first_owner + bluetooth + third_row_seating,
data = mydf)
Coefficients:
(Intercept) year
-7.275e+00 8.819e-03
mileage engine_size
-4.125e-06 5.796e-02
brand_groupLow_Price brand_groupMid_Price
-6.348e-01 -2.261e-01
drivetrain_groupFour-wheel Drive drivetrain_groupFront-wheel Drive
-3.085e-02 -2.073e-01
first_owner1 bluetooth1
4.096e-02 9.754e-02
third_row_seating1
6.896e-02
plot(model_log)
Performing cross-validation(e.g.RMSE, MAE etc) and Random Forest
# Cross-validation with caret package (example with 10-fold cross-validation)
set.seed(123)
cv_results <- train(
log_Price ~ year + mileage + engine_size + avg_mpg + brand_group + automatic_transmission + fuel_group + drivetrain_group + damaged + first_owner + navigation_system + bluetooth + third_row_seating + heated_seats,
data = mydf,
method = "lm",
trControl = trainControl(method = "cv", number = 10)
)
print(cv_results)
Linear Regression
410 samples
14 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 367, 370, 368, 368, 370, 369, ...
Resampling results:
RMSE Rsquared MAE
0.2001141 0.8338384 0.1522402
Tuning parameter 'intercept' was held constant at a value of TRUE
# RMSE (Root Mean Squared Error): The average error (in price) is 0.20
# R-squared: The proportion of variance in the car prices that can be explained by the model is 0.833
# MAE (Mean Absolute Error): Difference between predicted and actual prices is 0.1522
# Random Forest
model_rf <- randomForest(log_Price ~ year + mileage + engine_size + avg_mpg + brand_group + automatic_transmission + fuel_group + drivetrain_group + damaged + first_owner + navigation_system + bluetooth + third_row_seating + heated_seats, data = mydf)
print(model_rf)
Call:
randomForest(formula = log_Price ~ year + mileage + engine_size + avg_mpg + brand_group + automatic_transmission + fuel_group + drivetrain_group + damaged + first_owner + navigation_system + bluetooth + third_row_seating + heated_seats, data = mydf)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 0.02762614
% Var explained: 88.46
# Feature importance plot for Random Forest
varImpPlot(model_rf)
#brand_group, year and mileage have significantly higher IncNodePurity values so re the most important predictors in predicting car prices.
Summary of Model 3:
mod <- lm(formula = log_Price ~ year + mileage + engine_size + brand_group +
drivetrain_group + first_owner + bluetooth + third_row_seating,
data = mydf)
summary(mod)
Call:
lm(formula = log_Price ~ year + mileage + engine_size + brand_group +
drivetrain_group + first_owner + bluetooth + third_row_seating,
data = mydf)
Residuals:
Min 1Q Median 3Q Max
-0.62984 -0.10416 0.01315 0.12074 0.57856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.275e+00 3.981e+00 -1.828 0.06834 .
year 8.819e-03 1.977e-03 4.460 1.07e-05 ***
mileage -4.125e-06 3.993e-07 -10.332 < 2e-16 ***
engine_size 5.796e-02 1.146e-02 5.060 6.43e-07 ***
brand_groupLow_Price -6.348e-01 3.954e-02 -16.053 < 2e-16 ***
brand_groupMid_Price -2.261e-01 2.885e-02 -7.838 4.20e-14 ***
drivetrain_groupFour-wheel Drive -3.085e-02 2.777e-02 -1.111 0.26729
drivetrain_groupFront-wheel Drive -2.073e-01 3.219e-02 -6.442 3.40e-10 ***
first_owner1 4.096e-02 2.255e-02 1.816 0.07009 .
bluetooth1 9.754e-02 3.273e-02 2.981 0.00305 **
third_row_seating1 6.896e-02 3.703e-02 1.862 0.06335 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1951 on 399 degrees of freedom
Multiple R-squared: 0.8452, Adjusted R-squared: 0.8413
F-statistic: 217.8 on 10 and 399 DF, p-value: < 2.2e-16
Model Equation is: \[ \text{log_Price} = -7.275 + 0.00882 \times \text{year} - 4.125 \times 10^{-6} \times \text{mileage} + 0.05796 \times \text{engine_size} - 0.6348 \times \text{brand_groupLow_Price} - 0.2261 \times \text{brand_groupMid_Price} - 0.03085 \times \text{drivetrain_groupFour-wheel Drive} - 0.2073 \times \text{drivetrain_groupFront-wheel Drive} + 0.041 \times \text{first_owner1} + 0.09754 \times \text{bluetooth1} + 0.06896 \times \text{third_row_seating1} \] Regression Tree for modelling interactions:
mod.tree<-tree(log_Price ~ year+ mileage +engine_size +avg_mpg + brand_group +automatic_transmission +fuel_group + drivetrain_group+ damaged + first_owner + navigation_system + bluetooth + third_row_seating + heated_seats, data = mydf)
plot(mod.tree)
text(mod.tree)
Model 3 applied log transformation to price to reduce heteroscedasticity, resulting in the lowest RSE (0.195) and highest F-statistic (217.8), but lower R-squared values than model 2. Model 3 was chosen as the best model.
Cross-validation confirmed the high predictive power of model 3 (R-squared = 0.833).
Hence, from model2 and model3, model 3 is better as there is less hetroscedescity.
Exploration & Initial Model: Explore first_owner using glm with categorical variables. Fit an initial glm model with all relevant predictors for predicting used car prices.
Stepwise Selection: Utilize step() to iteratively select predictors.
Visualizing Improvement: Create plots for stepwise selection process.
Check for heteroscedasticity example by Residuals vs. Fitted Values etc
Heteroscedasticity Reduction: If identified, consider grouped/ungrouped levels in categorical variables.
Model Evaluation: Cross-validate essential metrics: Accuracy, Kappa, Random Forest. Use Random Forest or Regression Tree to identify influential predictors. Evaluate model plots for assessment.
# Distribution of the target variable
summary(mydf)
brand year mileage engine_size automatic_transmission fuel
Hyundai : 29 Min. :1953 Min. : 0 Min. :1.20 0: 33 Diesel : 2
Honda : 28 1st Qu.:2015 1st Qu.: 21328 1st Qu.:2.00 1:377 Electric: 7
FIAT : 24 Median :2019 Median : 43286 Median :2.40 GPL : 5
MINI : 24 Mean :2017 Mean : 48382 Mean :2.65 Hybrid : 13
Audi : 23 3rd Qu.:2021 3rd Qu.: 66593 3rd Qu.:3.30 Petrol :380
Cadillac: 21 Max. :2023 Max. :190312 Max. :6.40 Unknown : 3
(Other) :261
drivetrain damaged first_owner navigation_system bluetooth third_row_seating heated_seats
2WD : 1 0:315 0:200 0:228 0: 53 0:374 0:232
Four-wheel Drive :199 1: 95 1:210 1:182 1:357 1: 36 1:178
Front-wheel Drive:130
Rear-wheel Drive : 77
Unknown : 3
price avg_mpg brand_group drivetrain_group fuel_group
Min. : 5850 Min. : 0.00 High_Price: 77 Other : 81 Other : 30
1st Qu.:18991 1st Qu.:21.50 Low_Price :168 Four-wheel Drive :199 Petrol:380
Median :28512 Median :24.50 Mid_Price :165 Front-wheel Drive:130
Mean :28399 Mean :24.72
3rd Qu.:37564 3rd Qu.:27.00
Max. :54995 Max. :52.00
log_Price
Min. : 8.674
1st Qu.: 9.852
Median :10.258
Mean :10.148
3rd Qu.:10.534
Max. :10.915
str(mydf)
'data.frame': 410 obs. of 19 variables:
$ brand : Factor w/ 25 levels "Alfa","Audi",..: 2 12 7 11 1 20 4 4 9 1 ...
$ year : num 2022 2021 2021 2022 2019 ...
$ mileage : num 7232 60942 45701 2963 47587 ...
$ engine_size : num 2 1.6 3 3.6 2 5.6 2 3.6 1.6 2 ...
$ automatic_transmission: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ fuel : Factor w/ 6 levels "Diesel","Electric",..: 5 5 5 5 5 5 5 5 5 5 ...
$ drivetrain : Factor w/ 5 levels "2WD","Four-wheel Drive",..: 2 3 2 2 2 4 3 3 3 2 ...
$ damaged : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 2 1 1 ...
$ first_owner : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
$ navigation_system : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 2 1 ...
$ bluetooth : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 2 2 2 ...
$ third_row_seating : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 1 1 ...
$ heated_seats : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 2 1 ...
$ price : num 37500 15990 46290 44290 28990 ...
$ avg_mpg : num 32 32 21 25 25.5 15.5 7 21 23.5 28.5 ...
$ brand_group : Factor w/ 3 levels "High_Price","Low_Price",..: 3 2 1 1 3 2 3 3 2 3 ...
$ drivetrain_group : Factor w/ 3 levels "Other","Four-wheel Drive",..: 2 3 2 2 2 1 3 3 3 2 ...
$ fuel_group : Factor w/ 2 levels "Other","Petrol": 2 2 2 2 2 2 2 2 2 2 ...
$ log_Price : num 10.53 9.68 10.74 10.7 10.27 ...
table(mydf$first_owner)
0 1
200 210
numerical_cols <- c("year", "mileage", "engine_size", "avg_mpg", "price")
categorical_cols <- c("brand_group", "automatic_transmission", "fuel_group", "drivetrain_group", "damaged",
"navigation_system", "bluetooth", "third_row_seating", "heated_seats")
for(col in numerical_cols){
anova_result <- aov(mydf[[col]] ~ first_owner, data = mydf)
cat("ANOVA between 'first_owner' and '", col, "':\n")
print(summary(anova_result))
}
ANOVA between 'first_owner' and ' year ':
Df Sum Sq Mean Sq F value Pr(>F)
first_owner 1 1705 1705.0 47.2 2.41e-11 ***
Residuals 408 14740 36.1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'first_owner' and ' mileage ':
Df Sum Sq Mean Sq F value Pr(>F)
first_owner 1 9.479e+10 9.479e+10 101.7 <2e-16 ***
Residuals 408 3.802e+11 9.319e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA between 'first_owner' and ' engine_size ':
Df Sum Sq Mean Sq F value Pr(>F)
first_owner 1 0.2 0.2391 0.222 0.638
Residuals 408 440.1 1.0786
ANOVA between 'first_owner' and ' avg_mpg ':
Df Sum Sq Mean Sq F value Pr(>F)
first_owner 1 0 0.43 0.013 0.908
Residuals 408 12986 31.83
ANOVA between 'first_owner' and ' price ':
Df Sum Sq Mean Sq F value Pr(>F)
first_owner 1 8.852e+09 8.852e+09 73.66 <2e-16 ***
Residuals 408 4.903e+10 1.202e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#first_owner and avg_mpg insignificant corelation
#first_owner and engine_size insignificant corelation
# Fit logistic regression model
model_first_owner <- glm(first_owner ~ year + mileage + engine_size + avg_mpg + brand + automatic_transmission + fuel + drivetrain + damaged + price + navigation_system + bluetooth + third_row_seating + heated_seats, data = mydf, family = binomial)
# Summary of the model
summary(model_first_owner)
Call:
glm(formula = first_owner ~ year + mileage + engine_size + avg_mpg +
brand + automatic_transmission + fuel + drivetrain + damaged +
price + navigation_system + bluetooth + third_row_seating +
heated_seats, family = binomial, data = mydf)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.883e+02 2.914e+03 -0.065 0.94847
year 9.588e-02 5.065e-02 1.893 0.05835 .
mileage -2.265e-05 7.040e-06 -3.218 0.00129 **
engine_size -1.309e-01 2.033e-01 -0.644 0.51963
avg_mpg -6.941e-03 2.936e-02 -0.236 0.81309
brandAudi -5.486e-01 7.945e-01 -0.690 0.48988
brandBMW 4.469e-01 1.073e+00 0.416 0.67709
brandCadillac -2.123e-01 8.733e-01 -0.243 0.80794
brandChevrolet 3.401e-01 1.010e+00 0.337 0.73622
brandFIAT -1.178e+00 9.569e-01 -1.231 0.21828
brandFord -2.246e-02 1.074e+00 -0.021 0.98332
brandHonda -7.984e-02 8.017e-01 -0.100 0.92067
brandHyundai 7.630e-01 8.294e-01 0.920 0.35758
brandJaguar -1.185e+00 8.866e-01 -1.336 0.18150
brandJeep 4.735e-01 1.052e+00 0.450 0.65272
brandKia 4.786e-01 1.042e+00 0.459 0.64591
brandLand -2.102e+00 1.114e+00 -1.886 0.05924 .
brandLexus -8.071e-01 9.207e-01 -0.877 0.38065
brandMaserati -1.542e+00 8.834e-01 -1.746 0.08085 .
brandMazda 4.796e-01 8.670e-01 0.553 0.58014
brandMercedes-Benz -1.812e+00 1.163e+00 -1.557 0.11937
brandMINI -6.359e-01 8.510e-01 -0.747 0.45488
brandMitsubishi 1.776e+00 9.446e-01 1.880 0.06008 .
brandNissan 6.839e-02 8.932e-01 0.077 0.93897
brandPorsche -1.667e+00 1.287e+00 -1.295 0.19521
brandSuzuki 7.400e-02 1.482e+00 0.050 0.96017
brandToyota 5.979e-01 1.020e+00 0.586 0.55776
brandVolkswagen 6.254e-01 9.886e-01 0.633 0.52696
brandVolvo 6.436e-01 1.243e+00 0.518 0.60445
automatic_transmission1 1.487e+00 6.316e-01 2.355 0.01853 *
fuelElectric 1.472e+01 1.650e+03 0.009 0.99288
fuelGPL 7.998e-01 1.891e+03 0.000 0.99966
fuelHybrid 1.583e+01 1.650e+03 0.010 0.99234
fuelPetrol 1.555e+01 1.650e+03 0.009 0.99248
fuelUnknown 1.760e+01 1.650e+03 0.011 0.99149
drivetrainFour-wheel Drive -2.125e+01 2.400e+03 -0.009 0.99293
drivetrainFront-wheel Drive -2.083e+01 2.400e+03 -0.009 0.99307
drivetrainRear-wheel Drive -2.110e+01 2.400e+03 -0.009 0.99298
drivetrainUnknown -1.811e+01 2.400e+03 -0.008 0.99398
damaged1 3.135e-02 3.262e-01 0.096 0.92344
price 5.454e-05 2.654e-05 2.055 0.03988 *
navigation_system1 1.536e-01 3.752e-01 0.409 0.68219
bluetooth1 -9.633e-01 5.918e-01 -1.628 0.10355
third_row_seating1 7.971e-01 5.824e-01 1.369 0.17113
heated_seats1 -1.067e-02 3.093e-01 -0.035 0.97248
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 568.14 on 409 degrees of freedom
Residual deviance: 373.07 on 365 degrees of freedom
AIC: 463.07
Number of Fisher Scoring iterations: 15
step(model_first_owner)
Start: AIC=463.07
first_owner ~ year + mileage + engine_size + avg_mpg + brand +
automatic_transmission + fuel + drivetrain + damaged + price +
navigation_system + bluetooth + third_row_seating + heated_seats
Df Deviance AIC
- brand 24 407.65 449.65
- fuel 5 377.28 457.28
- heated_seats 1 373.07 461.07
- damaged 1 373.08 461.08
- avg_mpg 1 373.12 461.12
- navigation_system 1 373.24 461.24
- drivetrain 4 379.47 461.47
- engine_size 1 373.49 461.49
- third_row_seating 1 375.04 463.04
<none> 373.07 463.07
- bluetooth 1 375.84 463.84
- year 1 377.12 465.12
- price 1 377.35 465.35
- automatic_transmission 1 379.27 467.27
- mileage 1 384.06 472.06
Step: AIC=449.65
first_owner ~ year + mileage + engine_size + avg_mpg + automatic_transmission +
fuel + drivetrain + damaged + price + navigation_system +
bluetooth + third_row_seating + heated_seats
Df Deviance AIC
- fuel 5 413.75 445.75
- heated_seats 1 407.65 447.65
- avg_mpg 1 407.66 447.66
- navigation_system 1 407.77 447.77
- damaged 1 407.79 447.79
- engine_size 1 407.94 447.94
- price 1 408.14 448.14
<none> 407.65 449.65
- automatic_transmission 1 412.85 452.85
- third_row_seating 1 413.38 453.38
- bluetooth 1 417.22 457.22
- drivetrain 4 426.41 460.41
- mileage 1 424.87 464.87
- year 1 426.37 466.37
Step: AIC=445.75
first_owner ~ year + mileage + engine_size + avg_mpg + automatic_transmission +
drivetrain + damaged + price + navigation_system + bluetooth +
third_row_seating + heated_seats
Df Deviance AIC
- heated_seats 1 413.76 443.76
- damaged 1 413.83 443.83
- navigation_system 1 413.85 443.85
- avg_mpg 1 413.87 443.87
- engine_size 1 414.01 444.01
- price 1 414.45 444.45
<none> 413.75 445.75
- automatic_transmission 1 418.76 448.76
- third_row_seating 1 419.58 449.58
- bluetooth 1 421.75 451.75
- drivetrain 4 432.81 456.81
- year 1 431.56 461.56
- mileage 1 431.72 461.72
Step: AIC=443.76
first_owner ~ year + mileage + engine_size + avg_mpg + automatic_transmission +
drivetrain + damaged + price + navigation_system + bluetooth +
third_row_seating
Df Deviance AIC
- damaged 1 413.84 441.84
- navigation_system 1 413.85 441.85
- avg_mpg 1 413.88 441.88
- engine_size 1 414.01 442.01
- price 1 414.46 442.46
<none> 413.76 443.76
- automatic_transmission 1 418.76 446.76
- third_row_seating 1 419.62 447.62
- bluetooth 1 421.75 449.75
- drivetrain 4 432.84 454.84
- year 1 431.67 459.67
- mileage 1 431.74 459.74
Step: AIC=441.84
first_owner ~ year + mileage + engine_size + avg_mpg + automatic_transmission +
drivetrain + price + navigation_system + bluetooth + third_row_seating
Df Deviance AIC
- navigation_system 1 413.92 439.92
- avg_mpg 1 413.94 439.94
- engine_size 1 414.08 440.08
- price 1 414.54 440.54
<none> 413.84 441.84
- automatic_transmission 1 418.76 444.76
- third_row_seating 1 419.71 445.71
- bluetooth 1 421.99 447.99
- drivetrain 4 432.89 452.89
- year 1 431.91 457.91
- mileage 1 432.25 458.25
Step: AIC=439.92
first_owner ~ year + mileage + engine_size + avg_mpg + automatic_transmission +
drivetrain + price + bluetooth + third_row_seating
Df Deviance AIC
- avg_mpg 1 414.04 438.04
- engine_size 1 414.17 438.17
- price 1 414.54 438.54
<none> 413.92 439.92
- automatic_transmission 1 418.84 442.84
- third_row_seating 1 419.71 443.71
- bluetooth 1 422.83 446.83
- drivetrain 4 433.52 451.52
- year 1 432.42 456.42
- mileage 1 432.71 456.71
Step: AIC=438.04
first_owner ~ year + mileage + engine_size + automatic_transmission +
drivetrain + price + bluetooth + third_row_seating
Df Deviance AIC
- engine_size 1 414.21 436.21
- price 1 414.60 436.60
<none> 414.04 438.04
- automatic_transmission 1 418.93 440.93
- third_row_seating 1 420.01 442.01
- bluetooth 1 422.89 444.89
- drivetrain 4 434.30 450.30
- year 1 432.78 454.78
- mileage 1 432.83 454.83
Step: AIC=436.21
first_owner ~ year + mileage + automatic_transmission + drivetrain +
price + bluetooth + third_row_seating
Df Deviance AIC
- price 1 415.33 435.33
<none> 414.21 436.21
- automatic_transmission 1 419.31 439.31
- third_row_seating 1 420.97 440.97
- bluetooth 1 422.95 442.95
- drivetrain 4 434.50 448.50
- year 1 433.41 453.41
- mileage 1 433.66 453.66
Step: AIC=435.33
first_owner ~ year + mileage + automatic_transmission + drivetrain +
bluetooth + third_row_seating
Df Deviance AIC
<none> 415.33 435.33
- automatic_transmission 1 420.50 438.50
- third_row_seating 1 423.70 441.70
- bluetooth 1 424.02 442.02
- drivetrain 4 435.05 447.05
- year 1 440.10 458.10
- mileage 1 442.10 460.10
Call: glm(formula = first_owner ~ year + mileage + automatic_transmission +
drivetrain + bluetooth + third_row_seating, family = binomial,
data = mydf)
Coefficients:
(Intercept) year mileage
-4.107e+02 2.165e-01 -2.678e-05
automatic_transmission1 drivetrainFour-wheel Drive drivetrainFront-wheel Drive
1.246e+00 -2.479e+01 -2.445e+01
drivetrainRear-wheel Drive drivetrainUnknown bluetooth1
-2.487e+01 -1.915e+01 -1.493e+00
third_row_seating1
1.306e+00
Degrees of Freedom: 409 Total (i.e. Null); 400 Residual
Null Deviance: 568.1
Residual Deviance: 415.3 AIC: 435.3
plot(model_first_owner)
Warning: not plotting observations with leverage one:
94
#Suggested Model is:
first_owner_model <- glm(first_owner ~ year + mileage + automatic_transmission +
drivetrain + bluetooth + third_row_seating, family = binomial,
data = mydf)
summary(first_owner_model)
Call:
glm(formula = first_owner ~ year + mileage + automatic_transmission +
drivetrain + bluetooth + third_row_seating, family = binomial,
data = mydf)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.107e+02 8.885e+02 -0.462 0.64392
year 2.165e-01 5.123e-02 4.225 2.39e-05 ***
mileage -2.678e-05 5.483e-06 -4.884 1.04e-06 ***
automatic_transmission1 1.246e+00 5.771e-01 2.158 0.03090 *
drivetrainFour-wheel Drive -2.479e+01 8.827e+02 -0.028 0.97760
drivetrainFront-wheel Drive -2.445e+01 8.827e+02 -0.028 0.97791
drivetrainRear-wheel Drive -2.487e+01 8.827e+02 -0.028 0.97752
drivetrainUnknown -1.915e+01 8.828e+02 -0.022 0.98269
bluetooth1 -1.493e+00 5.192e-01 -2.875 0.00404 **
third_row_seating1 1.306e+00 4.808e-01 2.717 0.00659 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 568.14 on 409 degrees of freedom
Residual deviance: 415.33 on 400 degrees of freedom
AIC: 435.33
Number of Fisher Scoring iterations: 13
\[ \log\left(\frac{P(\text{first_owner} = 1)}{1 - P(\text{first_owner} = 1)}\right) = -410.7 + 0.2165 \times \text{year} - 0.00002678 \times \text{mileage} + 1.246 \times \text{automatic_transmission1} \\ -24.79 \times \text{drivetrain_Four-wheel Drive} - 24.45 \times \text{drivetrain_Front-wheel Drive} - 24.87 \times \text{drivetrain_Rear-wheel Drive} \\ -19.15 \times \text{drivetrain_Unknown} - 1.493 \times \text{bluetooth1} + 1.306 \times \text{third_row_seating1} \]
# Cross-validation with caret package (example with 10-fold cross-validation)
set.seed(123)
cv_results <- train(
first_owner ~ year + mileage + automatic_transmission + drivetrain + bluetooth + third_row_seating,
data = mydf,
method = "glm",
family = "binomial",
trControl = trainControl(method = "cv", number = 10)
)
Warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
print(cv_results)
Generalized Linear Model
410 samples
6 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 369, 369, 369, 369, 369, 369, ...
Resampling results:
Accuracy Kappa
0.7902439 0.5798615
# RMSE (Root Mean Squared Error): The average error (in price) is 1852
# R-squared: The proportion of variance in the car prices that can be explained by the model is 0.977
# MAE (Mean Absolute Error): Difference between predicted and actual prices is 1453
# Random Forest
model_rf <- randomForest(first_owner ~ year + mileage + automatic_transmission + drivetrain + bluetooth + third_row_seating, data = mydf)
print(model_rf)
Call:
randomForest(formula = first_owner ~ year + mileage + automatic_transmission + drivetrain + bluetooth + third_row_seating, data = mydf)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 22.68%
Confusion matrix:
0 1 class.error
0 153 47 0.2350000
1 46 164 0.2190476
# Feature importance plot for Random Forest
varImpPlot(model_rf)
# Predict probabilities of being a first owner
predictions <- predict( first_owner_model, type = "response")
# Visualize actual vs predicted probabilities
ggplot(mydf, aes(x = predictions, fill = as.factor(first_owner))) +
geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
labs(x = "Predicted Probability of Being First Owner", y = "Frequency") +
scale_fill_discrete(name = "First Owner", labels = c("Not First Owner", "First Owner")) +
theme_minimal()
There are too many predictor variables. Trying to model grouped categories with first_owner, building model 2
model2_first_owner <- glm(first_owner ~ year + mileage + engine_size + avg_mpg + brand_group + automatic_transmission + fuel_group + drivetrain_group + damaged + price + navigation_system + bluetooth + third_row_seating + heated_seats, data = mydf, family = binomial)
# Summary of the model
summary(model2_first_owner)
Call:
glm(formula = first_owner ~ year + mileage + engine_size + avg_mpg +
brand_group + automatic_transmission + fuel_group + drivetrain_group +
damaged + price + navigation_system + bluetooth + third_row_seating +
heated_seats, family = binomial, data = mydf)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.484e+02 8.342e+01 -1.779 0.07516 .
year 7.154e-02 4.160e-02 1.720 0.08546 .
mileage -3.163e-05 6.309e-06 -5.013 5.36e-07 ***
engine_size 5.585e-03 1.571e-01 0.036 0.97164
avg_mpg 1.819e-02 2.453e-02 0.741 0.45840
brand_groupLow_Price 2.198e+00 8.938e-01 2.459 0.01394 *
brand_groupMid_Price 8.178e-01 5.404e-01 1.513 0.13019
automatic_transmission1 1.402e+00 5.493e-01 2.552 0.01070 *
fuel_groupPetrol 3.909e-01 4.809e-01 0.813 0.41630
drivetrain_groupFour-wheel Drive 1.650e-01 3.415e-01 0.483 0.62891
drivetrain_groupFront-wheel Drive 7.308e-01 4.168e-01 1.753 0.07955 .
damaged1 -6.191e-02 2.951e-01 -0.210 0.83381
price 9.942e-05 3.320e-05 2.994 0.00275 **
navigation_system1 -1.345e-01 2.941e-01 -0.457 0.64742
bluetooth1 -1.083e+00 5.037e-01 -2.151 0.03149 *
third_row_seating1 1.449e+00 5.186e-01 2.794 0.00521 **
heated_seats1 1.188e-01 2.668e-01 0.445 0.65613
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 568.14 on 409 degrees of freedom
Residual deviance: 420.99 on 393 degrees of freedom
AIC: 454.99
Number of Fisher Scoring iterations: 5
step(model2_first_owner)
Start: AIC=454.99
first_owner ~ year + mileage + engine_size + avg_mpg + brand_group +
automatic_transmission + fuel_group + drivetrain_group +
damaged + price + navigation_system + bluetooth + third_row_seating +
heated_seats
Df Deviance AIC
- engine_size 1 420.99 452.99
- damaged 1 421.04 453.04
- heated_seats 1 421.19 453.19
- navigation_system 1 421.20 453.20
- avg_mpg 1 421.54 453.54
- fuel_group 1 421.65 453.65
- drivetrain_group 2 424.57 454.57
<none> 420.99 454.99
- year 1 425.43 457.43
- bluetooth 1 425.83 457.83
- brand_group 2 429.08 459.08
- automatic_transmission 1 428.14 460.14
- third_row_seating 1 429.72 461.72
- price 1 430.28 462.28
- mileage 1 449.43 481.43
Step: AIC=452.99
first_owner ~ year + mileage + avg_mpg + brand_group + automatic_transmission +
fuel_group + drivetrain_group + damaged + price + navigation_system +
bluetooth + third_row_seating + heated_seats
Df Deviance AIC
- damaged 1 421.04 451.04
- heated_seats 1 421.19 451.19
- navigation_system 1 421.20 451.20
- avg_mpg 1 421.57 451.57
- fuel_group 1 421.66 451.66
- drivetrain_group 2 424.58 452.58
<none> 420.99 452.99
- year 1 425.67 455.67
- bluetooth 1 425.84 455.84
- brand_group 2 429.08 457.08
- automatic_transmission 1 428.22 458.22
- third_row_seating 1 430.21 460.21
- price 1 430.97 460.97
- mileage 1 451.25 481.25
Step: AIC=451.04
first_owner ~ year + mileage + avg_mpg + brand_group + automatic_transmission +
fuel_group + drivetrain_group + price + navigation_system +
bluetooth + third_row_seating + heated_seats
Df Deviance AIC
- heated_seats 1 421.24 449.24
- navigation_system 1 421.24 449.24
- avg_mpg 1 421.59 449.59
- fuel_group 1 421.71 449.71
- drivetrain_group 2 424.58 450.58
<none> 421.04 451.04
- year 1 425.77 453.77
- bluetooth 1 425.94 453.94
- brand_group 2 429.24 455.24
- automatic_transmission 1 428.24 456.24
- third_row_seating 1 430.26 458.26
- price 1 431.01 459.01
- mileage 1 452.11 480.11
Step: AIC=449.24
first_owner ~ year + mileage + avg_mpg + brand_group + automatic_transmission +
fuel_group + drivetrain_group + price + navigation_system +
bluetooth + third_row_seating
Df Deviance AIC
- navigation_system 1 421.35 447.35
- avg_mpg 1 421.82 447.82
- fuel_group 1 421.93 447.93
- drivetrain_group 2 424.76 448.76
<none> 421.24 449.24
- bluetooth 1 426.06 452.06
- year 1 426.09 452.09
- brand_group 2 429.28 453.28
- automatic_transmission 1 428.33 454.33
- third_row_seating 1 430.50 456.50
- price 1 431.14 457.14
- mileage 1 452.45 478.45
Step: AIC=447.35
first_owner ~ year + mileage + avg_mpg + brand_group + automatic_transmission +
fuel_group + drivetrain_group + price + bluetooth + third_row_seating
Df Deviance AIC
- avg_mpg 1 421.98 445.98
- fuel_group 1 422.10 446.10
- drivetrain_group 2 425.09 447.09
<none> 421.35 447.35
- year 1 426.27 450.27
- bluetooth 1 426.73 450.73
- brand_group 2 429.56 451.56
- automatic_transmission 1 428.48 452.48
- third_row_seating 1 430.50 454.50
- price 1 431.14 455.14
- mileage 1 453.51 477.51
Step: AIC=445.98
first_owner ~ year + mileage + brand_group + automatic_transmission +
fuel_group + drivetrain_group + price + bluetooth + third_row_seating
Df Deviance AIC
- fuel_group 1 422.53 444.53
<none> 421.98 445.98
- drivetrain_group 2 426.50 446.50
- bluetooth 1 427.29 449.29
- year 1 427.33 449.33
- brand_group 2 430.12 450.12
- automatic_transmission 1 428.95 450.95
- third_row_seating 1 431.31 453.31
- price 1 431.31 453.31
- mileage 1 455.33 477.33
Step: AIC=444.53
first_owner ~ year + mileage + brand_group + automatic_transmission +
drivetrain_group + price + bluetooth + third_row_seating
Df Deviance AIC
<none> 422.53 444.53
- drivetrain_group 2 426.86 444.86
- bluetooth 1 427.79 447.79
- year 1 428.21 448.21
- brand_group 2 430.69 448.69
- automatic_transmission 1 429.23 449.23
- price 1 431.81 451.81
- third_row_seating 1 432.09 452.09
- mileage 1 456.37 476.37
Call: glm(formula = first_owner ~ year + mileage + brand_group + automatic_transmission +
drivetrain_group + price + bluetooth + third_row_seating,
family = binomial, data = mydf)
Coefficients:
(Intercept) year
-1.591e+02 7.731e-02
mileage brand_groupLow_Price
-3.262e-05 2.209e+00
brand_groupMid_Price automatic_transmission1
8.435e-01 1.348e+00
drivetrain_groupFour-wheel Drive drivetrain_groupFront-wheel Drive
1.658e-01 7.705e-01
price bluetooth1
9.460e-05 -1.098e+00
third_row_seating1
1.458e+00
Degrees of Freedom: 409 Total (i.e. Null); 399 Residual
Null Deviance: 568.1
Residual Deviance: 422.5 AIC: 444.5
plot(model2_first_owner)
first_owner_model <- glm(first_owner ~ year + mileage + brand_group + automatic_transmission +
drivetrain_group + price + bluetooth + third_row_seating,
family = binomial, data = mydf)
summary(first_owner_model)
Call:
glm(formula = first_owner ~ year + mileage + brand_group + automatic_transmission +
drivetrain_group + price + bluetooth + third_row_seating,
family = binomial, data = mydf)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.591e+02 8.090e+01 -1.966 0.04927 *
year 7.731e-02 4.035e-02 1.916 0.05537 .
mileage -3.262e-05 5.975e-06 -5.460 4.77e-08 ***
brand_groupLow_Price 2.209e+00 8.857e-01 2.494 0.01263 *
brand_groupMid_Price 8.435e-01 5.357e-01 1.574 0.11540
automatic_transmission1 1.348e+00 5.467e-01 2.465 0.01370 *
drivetrain_groupFour-wheel Drive 1.658e-01 3.306e-01 0.501 0.61612
drivetrain_groupFront-wheel Drive 7.705e-01 4.029e-01 1.913 0.05581 .
price 9.460e-05 3.162e-05 2.992 0.00277 **
bluetooth1 -1.098e+00 4.884e-01 -2.247 0.02461 *
third_row_seating1 1.458e+00 5.009e-01 2.911 0.00361 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 568.14 on 409 degrees of freedom
Residual deviance: 422.53 on 399 degrees of freedom
AIC: 444.53
Number of Fisher Scoring iterations: 5
Cross Validation:
# Cross-validation with caret package (example with 10-fold cross-validation)
set.seed(123)
cv_results <- train(
first_owner ~ year + mileage + brand_group + automatic_transmission +
drivetrain_group + price + bluetooth + third_row_seating,
data = mydf,
method = "glm",
family = "binomial",
trControl = trainControl(method = "cv", number = 10)
)
print(cv_results)
Generalized Linear Model
410 samples
8 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 369, 369, 369, 369, 369, 369, ...
Resampling results:
Accuracy Kappa
0.7804878 0.5599078
# Random Forest
model_rf <- randomForest(first_owner ~ year + mileage + brand_group + automatic_transmission +
drivetrain_group + price + bluetooth + third_row_seating, data = mydf)
print(model_rf)
Call:
randomForest(formula = first_owner ~ year + mileage + brand_group + automatic_transmission + drivetrain_group + price + bluetooth + third_row_seating, data = mydf)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 21.71%
Confusion matrix:
0 1 class.error
0 158 42 0.2100000
1 47 163 0.2238095
# Feature importance plot for Random Forest
varImpPlot(model_rf)
#year, mileage, price imp features
\[ \log\left(\frac{P(\text{first_owner} = 1)}{1 - P(\text{first_owner} = 1)}\right) = -159.1 + 0.077 \times \text{year} - 3.26 \times 10^{-5} \times \text{mileage} + 2.21 \times \text{brand_groupLow_Price} \\ + 0.84 \times \text{brand_groupMid_Price} + 1.35 \times \text{automatic_transmission1} + 0.17 \times \text{drivetrain_groupFour-wheel Drive} \\ + 0.77 \times \text{drivetrain_groupFront-wheel Drive} + 0.0000946 \times \text{price} - 1.10 \times \text{bluetooth1} + 1.46 \times \text{third_row_seating1} \]
exp(coef(first_owner_model))
(Intercept) year mileage
8.300713e-70 1.080380e+00 9.999674e-01
brand_groupLow_Price brand_groupMid_Price automatic_transmission1
9.105458e+00 2.324399e+00 3.848569e+00
drivetrain_groupFour-wheel Drive drivetrain_groupFront-wheel Drive price
1.180305e+00 2.160883e+00 1.000095e+00
bluetooth1 third_row_seating1
3.336299e-01 4.296700e+00
Certain variables (e.g., brand groups, automatic transmission, front-wheel drive, third-row seating) exhibit values greater than 1, suggesting a substantial positive effect on the likelihood of the car being sold by the first owner with each unit increase.
Conversely, variables like mileage and Bluetooth show values near 1, indicating minimal impact on these odds.
Justify and propose one model. Describe, Explaining and Critiquing it.
Model 1: Null deviance = 568.14, Residual deviance = 415.33, AIC = 435.33 Generalized Linear Model: Accuracy = 0.790, Kappa = 0.580 Random Forest: OOB Error rate = 22.68%
Model 2: Null deviance = 568.14, Residual deviance = 422.53, AIC = 444.53 Generalized Linear Model: Accuracy = 0.780, Kappa = 0.560 Random Forest: OOB Error rate = 21.71%
Two models were compared: Model 1 with all original variables and Model 2 with grouped categorical levels of brand, fuel, and drivetrain.
Model 1 had a lower AIC and residual deviance, indicating better statistical fit, but Model 2 was more interpretable due to reduced categorical levels.
In cross-validation, Model 1 slightly outperformed Model 2 in accuracy and kappa values. However, Model 2 had a slightly lower error rate in the Random Forest algorithm, suggesting better predictive performance.
Visual inspection of residuals showed that Model 2 had better homoscedasticity and independence of errors. Therefore, Model 2 was chosen as the ideal model despite Model 1’s lower AIC and residual deviance.
Random Forest identified price, mileage, and year as key predictors of price.
Key Trends:
(OpenAI 2022) (Microsoft 2023)