In another post, I describe how I use this data that I’ve scraped, but I wanted to provide a more in-depth tutorial for those interested in how I got the data. Note, this data belongs to Truecar, so all uses herein are for personal and academic reasons only.

Get the data

In order to do any good analaysis, you first need data. I prefer to have more data than less, where possible. In this case, I don’t have any data, so I use webscraping to get the data. There are much better tutorials on how to scrape data, so I’ll be light. I use R’s rvest package here, which does a decent job.1 Let’s look at Truecar’s Used Car postings2. First I use google to find the search query on Truecar that I like.

# Load packages
library(rvest)
library(dplyr)
library(magrittr)
# Find the URL of the data you want to scrape
url <- 'https://www.truecar.com/used-cars-for-sale/listings/ford/edge/'
read_html(url)
## {xml_document}
## <html lang="en-US">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n    <div id="main">\n<div></div>\n<div data-qa="TransitionHo ...

You’ll see there’s a head and a body. Our data’s in the body, so let’s use html_nodes() and html_text() to parse out the data we want. I used Selectorgadget to know what HTML classes to search for.

read_html(url) %>% html_nodes('.col-xs-6.col-sm-8.no-left-padding') %>% html_text()
## character(0)

So that’s how you get the data on a single page. If you look closer at the URL, you see a lot of helpful things. First, there’s the make, then the model, then the location-zip, then the year-range, and ultimately the trim. This is a very pretty and clean URL. If you click on a few additional pages, you’ll see the URL opens up with ?page=2.

https://www.truecar.com/used-cars-for-sale/listings/ford/edge/location-90210/?page=2

This is our ‘in’ to scraping multiple pages. I won’t bore you with the details of how to get that data into a neat matrix for us to analyze, but suffice it to say that I’m able to do it. Just build a function to construct a URL, and build a loop to go through the different pages, then use lots of str_extract from the stringr package and gsub to clean up the data.

library(stringr)

make = 'ford'
model = 'edge'
zip = '90210'
year = 2012
npages = 5

url <- paste('https://www.truecar.com/used-cars-for-sale/listings/', 
             make, '/', 
             model ,
             '/location-', zip,
             '/year-',year,'-max/?page=', sep = "")

urls <- paste(url, 1:npages, sep = "")

scrape <- function(pageno){
  try(
    read_html(urls[pageno]) %>% html_nodes('.col-xs-6.col-sm-8.no-left-padding') %>% html_text()
  )
}

long_list = scrape(1)
for(i in 2:npages){
  print(i)
  new_list = try(scrape(i))
  
  error = ("try-error" %in% class(new_list))
  
  if( error == FALSE ){
    long_list = c(long_list, new_list) 
  } else {
    break
  }
}
## [1] 2
stats <- long_list
df <- as.data.frame(stats)
df$stats %<>% as.character()
df$price <- str_extract(df$stats, '\\$[0-9]*,[0-9]*') %>% 
  gsub('Price: |\\$|,', '', .) %>%
  as.numeric()
df$year <- str_extract(df$stats, '^[0-9]* ') %>% 
  as.numeric()
df$mileage <- str_extract(df$stats, 'Mileage: [0-9]*,[0-9]*') %>% 
  gsub('Mileage: |,', '', .) %>%
  as.numeric()

# a = df$stats[1]
df$trim <- str_extract(df$stats, '.*Mileage:') %>% 
  gsub('FWD|AWD|4x[24]|[24]WD|V6|4-cyl|^[0-9][0-9][0-9][0-9]|4dr|Automatic|Manual|Mileage:', '', ., ignore.case = T) %>% 
  gsub(make, '', ., ignore.case = T) %>% 
  gsub(model, '', ., ignore.case = T) %>% 
  trimws() 


df$awd <- grepl('AWD|4WD|4x4', df$stats, ignore.case = T) %>% as.numeric()
df$manual <- grepl('manual', df$stats) %>% as.numeric()
df$v6 <- grepl('V6', df$stats) %>% as.numeric()
df$location <- str_extract(df$stats, 'Location: .*Exterior:') %>% 
  gsub('Location: |Exterior:', '', .) %>% 
  trimws() 
df$ext <- str_extract(df$stats, 'Exterior: .*Interior:') %>% 
  gsub('Interior:|Exterior:', '', .) %>% 
  trimws() 
df$int <- str_extract(df$stats, 'Interior: .*VIN:') %>% 
  gsub('Interior: |VIN:', '', .) %>% 
  trimws() 
df$vin <- str_extract(df$stats, 'VIN: .*\\$') %>% 
  gsub('VIN: |\\$', '', .) %>% 
  substr(., 1, 17)
df$deal <- str_extract(df$stats, '\\$[0-9]*,[0-9]* below') %>% 
  gsub('below|\\$|,', '', .) %>% trimws() %>%
  as.numeric()

And here’s what the results look like. You’ve got the original scraped data in the stats column and then everything else that you can parse out. Just like that, you’ve got

# df was the dataframe object we needed
df %>% select(-stats) %>% head(10) %>% formattable::formattable()
price year mileage trim awd manual v6 location ext int vin deal

  1. Python’s beautifulSoup package could probably do a better job.

  2. I tried scraping CarGurus, but wasn’t able to paginate. I tried scraping CarMax, but had difficulty. Edmunds was also easy, but Truecar was easiset.