Buying a used car the data science way: Part 1

webscraping cars pricing r
Bryan Whiting
2017-02-18

[Update 2021-11-16] This analysis was originally written on my old blog here. You can find the source code for it here. This code no longer works as TrueCar changed their CSS to make it more difficult to scrape. It’s still possible, but you’d need to build a custom scraper from scratch.

This is part 1 out of a two-part series on scraping used car data. Check out part 2 to learn how to analyze the data.

In another post, I describe how I use this data that I’ve scraped, but I wanted to provide a more in-depth tutorial for those interested in how I got the data. Note, this data belongs to Truecar, so all uses herein are for personal and academic reasons only.

Get the data

In order to do any good analaysis, you first need data. I prefer to have more data than less, where possible. In this case, I don’t have any data, so I use webscraping to get the data. There are much better tutorials on how to scrape data, so I’ll be light. I use R’s rvest package here, which does a decent job.1 Let’s look at Truecar’s Used Car postings2. First I use google to find the search query on Truecar that I like.

Load packages

library(rvest)
library(dplyr)
library(magrittr)
# Find the URL of the data you want to scrape
url <- 'https://www.truecar.com/used-cars-for-sale/listings/ford/edge/'
read_html(url)
## {xml_document}
## <html lang="en-US">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n    <!-- Element target for any partner code meant to execut ...

You’ll see there’s a head and a body. Our data’s in the body, so let’s use html_nodes() and html_text() to parse out the data we want. I used Selectorgadget to know what HTML classes to search for.

read_html(url) %>% html_nodes('.col-xs-6.col-sm-8.no-left-padding') %>% html_text()
## character(0)

So that’s how you get the data on a single page. If you look closer at the URL, you see a lot of helpful things. First, there’s the make, then the model, then the location-zip, then the year-range, and ultimately the trim. This is a very pretty and clean URL. If you click on a few additional pages, you’ll see the URL opens up with ?page=2.

https://www.truecar.com/used-cars-for-sale/listings/ford/edge/location-90210/?page=2

This is our ‘in’ to scraping multiple pages. I won’t bore you with the details of how to get that data into a neat matrix for us to analyze, but suffice it to say that I’m able to do it. Just build a function to construct a URL, and build a loop to go through the different pages, then use lots of str_extract() from the stringr package and gsub to clean up the data.

library(stringr)

make = 'ford'
model = 'edge'
zip = '90210'
year = 2012
npages = 5

url <- paste('https://www.truecar.com/used-cars-for-sale/listings/', 
             make, '/', 
             model ,
             '/location-', zip,
             '/year-',year,'-max/?page=', sep = "")

urls <- paste(url, 1:npages, sep = "")

scrape <- function(pageno){
  try(
    read_html(urls[pageno]) %>% html_nodes('.col-xs-6.col-sm-8.no-left-padding') %>% html_text()
  )
}

long_list = scrape(1)
for(i in 2:npages){
  print(i)
  new_list = try(scrape(i))
  
  error = ("try-error" %in% class(new_list))
  
  if( error == FALSE ){
    long_list = c(long_list, new_list) 
  } else {
    break
  }
}
## [1] 2
stats <- long_list
df <- as.data.frame(stats)
df$stats %<>% as.character()
df$price <- str_extract(df$stats, '\\$[0-9]*,[0-9]*') %>% 
  gsub('Price: |\\$|,', '', .) %>%
  as.numeric()
df$year <- str_extract(df$stats, '^[0-9]* ') %>% 
  as.numeric()
df$mileage <- str_extract(df$stats, 'Mileage: [0-9]*,[0-9]*') %>% 
  gsub('Mileage: |,', '', .) %>%
  as.numeric()

# a = df$stats[1]
df$trim <- str_extract(df$stats, '.*Mileage:') %>% 
  gsub('FWD|AWD|4x[24]|[24]WD|V6|4-cyl|^[0-9][0-9][0-9][0-9]|4dr|Automatic|Manual|Mileage:', '', ., ignore.case = T) %>% 
  gsub(make, '', ., ignore.case = T) %>% 
  gsub(model, '', ., ignore.case = T) %>% 
  trimws() 


df$awd <- grepl('AWD|4WD|4x4', df$stats, ignore.case = T) %>% as.numeric()
df$manual <- grepl('manual', df$stats) %>% as.numeric()
df$v6 <- grepl('V6', df$stats) %>% as.numeric()
df$location <- str_extract(df$stats, 'Location: .*Exterior:') %>% 
  gsub('Location: |Exterior:', '', .) %>% 
  trimws() 
df$ext <- str_extract(df$stats, 'Exterior: .*Interior:') %>% 
  gsub('Interior:|Exterior:', '', .) %>% 
  trimws() 
df$int <- str_extract(df$stats, 'Interior: .*VIN:') %>% 
  gsub('Interior: |VIN:', '', .) %>% 
  trimws() 
df$vin <- str_extract(df$stats, 'VIN: .*\\$') %>% 
  gsub('VIN: |\\$', '', .) %>% 
  substr(., 1, 17)
df$deal <- str_extract(df$stats, '\\$[0-9]*,[0-9]* below') %>% 
  gsub('below|\\$|,', '', .) %>% trimws() %>%
  as.numeric()

And here’s what the results look like. You’ve got the original scraped data in the stats column and then everything else that you can parse out.

# df was the dataframe object we needed
df %>% select(-stats) %>% head(10) %>% formattable::formattable()