In another post, I describe how I use this data that I’ve scraped, but I wanted to provide a more in-depth tutorial for those interested in how I got the data. Note, this data belongs to Truecar, so all uses herein are for personal and academic reasons only.

Get the data

In order to do any good analaysis, you first need data. I prefer to have more data than less, where possible. In this case, I don’t have any data, so I use webscraping to get the data. There are much better tutorials on how to scrape data, so I’ll be light. I use R’s rvest package here, which does a decent job.1 Let’s look at Truecar’s Used Car postings2. First I use google to find the search query on Truecar that I like.

# Load packages
library(rvest)
library(dplyr)
library(magrittr)
# Find the URL of the data you want to scrape
url <- 'https://www.truecar.com/used-cars-for-sale/listings/ford/edge/location-90210/year-2015-max/?trimSlug=sel-awd'
read_html(url)
## {xml_document}
## <html lang="en-US" data-qa="Index">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body><div>\n<div><div id="main"><div><div data-qa="TransitionHooks" ...

You’ll see there’s a head and a body. Our data’s in the body, so let’s use html_nodes() and html_text() to parse out the data we want. I used Selectorgadget to know what HTML classes to search for.

read_html(url) %>% html_nodes('.col-xs-6.col-sm-8.no-left-padding') %>% html_text()
##  [1] "2015 Ford EdgeSEMileage: 45,702 milesLocation: Santa Ana, CAExterior: Ingot Silver MetallicInterior: BlackVIN: 2FMTK3G84FBB96919$14,899View Details"                                       
##  [2] "2015 Ford EdgeSEMileage: 31,750 milesLocation: Los Angeles, CAExterior: SilverInterior: BlackVIN: 2FMTK3G96FBB52901 Rating: Great Price$17,980$6,458 below marketView Details"             
##  [3] "2015 Ford EdgeSEMileage: 62,844 milesLocation: Pacoima, CAExterior: BlackVIN: 2FMTK4G87FBB97410$15,482View Details"                                                                        
##  [4] "2015 Ford EdgeSEMileage: 36,399 milesLocation: Huntington Beach, CAExterior: Tuxedo BlackInterior: EbonyVIN: 2FMTK3G97FBB92565$17,890View Details"                                         
##  [5] "2015 Ford EdgeSEMileage: 35,874 milesLocation: Huntington Beach, CAExterior: Tuxedo Black MetallicInterior: EbonyVIN: 2FMTK3G98FBC34774$18,990View Details"                                
##  [6] "2015 Ford EdgeSELMileage: 21,829 milesLocation: Los Angeles, CAExterior: GrayInterior: BlackVIN: 2FMTK3J95FBB01835 Rating: Great Price$20,480$7,063 below marketView Details"              
##  [7] "2015 Ford EdgeSEMileage: 62,747 milesLocation: Montclair, CAExterior: GrayInterior: TanVIN: 2FMTK3G98FBB43553Discount Available  Rating: Great Price$15,400$8,373 below marketView Details"
##  [8] "2015 Ford EdgeSEMileage: 7,526 milesLocation: Glendale, CAExterior: WhiteInterior: DuneVIN: 2FMTK3G95FBC20895$22,500View Details"                                                          
##  [9] "2015 Ford EdgeSEMileage: 22,098 milesLocation: Downey, CAExterior: WhiteInterior: BlackVIN: 2FMTK4G88FBC08043$20,400View Details"                                                          
## [10] "2015 Ford EdgeSEMileage: 30,332 milesLocation: Downey, CAExterior: WhiteInterior: BlackVIN: 2FMTK4G92FBB40573$19,900View Details"                                                          
## [11] "2015 Ford EdgeSELMileage: 33,841 milesLocation: Downey, CAExterior: BlackInterior: BlackVIN: 2FMTK3J86FBB20045 Rating: Great Price$21,500$4,454 below marketView Details"                  
## [12] "2015 Ford EdgeSELMileage: 5,470 milesLocation: Glendora, CAExterior: Guard MetallicInterior: EbonyVIN: 2FMTK3J98FBC40065$23,965View Details"                                               
## [13] "2015 Ford EdgeTitaniumMileage: 30,703 milesLocation: Montebello, CAExterior: Tuxedo Black MetallicInterior: BlackVIN: 2FMPK3K8XFBB06293$22,499View Details"                                
## [14] "2015 Ford EdgeSELMileage: 41,010 milesLocation: GARDEN GROVE, CAExterior: GrayInterior: EbonyVIN: 2FMTK3J94FBB47723$22,495View Details"                                                    
## [15] "2015 Ford EdgeSELMileage: 77,865 milesLocation: Rancho Santa Margarita, CAExterior: Tuxedo Black MetallicInterior: DuneVIN: 2FMTK3J94FBB47740$18,700View Details"                          
## [16] "2015 Ford EdgeSELMileage: 18,208 milesLocation: Carson, CAExterior: Ingot Silver MetallicInterior: EbonyVIN: 2FMTK4J82FBC05227Discount Available $24,997View Details"                      
## [17] "2015 Ford EdgeSELMileage: 30,942 milesLocation: Huntington Beach, CAExterior: Ingot SilverInterior: EbonyVIN: 2FMTK3J98FBB45599$23,490View Details"                                        
## [18] "2015 Ford EdgeSELMileage: 38,035 milesLocation: Downey, CAExterior: WhiteInterior: BeigeVIN: 2FMTK4J93FBB59643$22,100View Details"                                                         
## [19] "2015 Ford EdgeSELMileage: 38,633 milesLocation: Downey, CAExterior: BlackInterior: BlackVIN: 2FMTK4J9XFBC10183$22,200View Details"                                                         
## [20] "2015 Ford EdgeSELMileage: 27,169 milesLocation: Alhambra, CAExterior: BlackInterior: EbonyVIN: 2FMTK3J94FBB45521$24,991View Details"                                                       
## [21] "2015 Ford EdgeSELMileage: 24,922 milesLocation: Long Beach, CAExterior: BlackInterior: EbonyVIN: 2FMTK3J8XFBB55896Discount Available $24,378View Details"                                  
## [22] "2015 Ford EdgeTitaniumMileage: 19,284 milesLocation: Huntington Beach, CAExterior: Tuxedo Black MetallicInterior: EbonyVIN: 2FMPK3K87FBB67701$25,590View Details"                          
## [23] "2015 Ford EdgeTitaniumMileage: 36,482 milesLocation: Downey, CAExterior: SilverInterior: BlackVIN: 2FMTK3K9XFBB06656 Rating: Great Price$23,400$6,371 below marketView Details"            
## [24] "2015 Ford EdgeSELMileage: 29,704 milesLocation: Huntington Beach, CAExterior: Tuxedo BlackInterior: EbonyVIN: 2FMTK3J99FBB47748$24,890View Details"                                        
## [25] "2015 Ford EdgeSELMileage: 19,405 milesLocation: Long Beach, CAExterior: BlackInterior: EbonyVIN: 2FMTK3J80FBB34135Discount Available $25,133View Details"

So that’s how you get the data on a single page. If you look closer at the URL, you see a lot of helpful things. First, there’s the make, then the model, then the location-zip, then the year-range, and ultimately the trim. This is a very pretty and clean URL. If you click on a few additional pages, you’ll see the URL opens up with ?page=2.

https://www.truecar.com/used-cars-for-sale/listings/ford/edge/location-90210/?page=2

This is our ‘in’ to scraping multiple pages. I won’t bore you with the details of how to get that data into a neat matrix for us to analyze, but suffice it to say that I’m able to do it. Just build a function to construct a URL, and build a loop to go through the different pages, then use lots of str_extract from the stringr package and gsub to clean up the data.

library(stringr)

make = 'ford'
model = 'edge'
zip = '90210'
year = 2012
npages = 5

url <- paste('https://www.truecar.com/used-cars-for-sale/listings/', 
             make, '/', 
             model ,
             '/location-', zip,
             '/year-',year,'-max/?page=', sep = "")

urls <- paste(url, 1:npages, sep = "")

scrape <- function(pageno){
  try(
    read_html(urls[pageno]) %>% html_nodes('.col-xs-6.col-sm-8.no-left-padding') %>% html_text()
  )
}

long_list = scrape(1)
for(i in 2:npages){
  print(i)
  new_list = try(scrape(i))
  
  error = ("try-error" %in% class(new_list))
  
  if( error == FALSE ){
    long_list = c(long_list, new_list) 
  } else {
    break
  }
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
stats <- long_list
df <- as.data.frame(stats)
df$stats %<>% as.character()
df$price <- str_extract(df$stats, '\\$[0-9]*,[0-9]*') %>% 
  gsub('Price: |\\$|,', '', .) %>%
  as.numeric()
df$year <- str_extract(df$stats, '^[0-9]* ') %>% 
  as.numeric()
df$mileage <- str_extract(df$stats, 'Mileage: [0-9]*,[0-9]*') %>% 
  gsub('Mileage: |,', '', .) %>%
  as.numeric()

# a = df$stats[1]
df$trim <- str_extract(df$stats, '.*Mileage:') %>% 
  gsub('FWD|AWD|4x[24]|[24]WD|V6|4-cyl|^[0-9][0-9][0-9][0-9]|4dr|Automatic|Manual|Mileage:', '', ., ignore.case = T) %>% 
  gsub(make, '', ., ignore.case = T) %>% 
  gsub(model, '', ., ignore.case = T) %>% 
  trimws() 


df$awd <- grepl('AWD|4WD|4x4', df$stats, ignore.case = T) %>% as.numeric()
df$manual <- grepl('manual', df$stats) %>% as.numeric()
df$v6 <- grepl('V6', df$stats) %>% as.numeric()
df$location <- str_extract(df$stats, 'Location: .*Exterior:') %>% 
  gsub('Location: |Exterior:', '', .) %>% 
  trimws() 
df$ext <- str_extract(df$stats, 'Exterior: .*Interior:') %>% 
  gsub('Interior:|Exterior:', '', .) %>% 
  trimws() 
df$int <- str_extract(df$stats, 'Interior: .*VIN:') %>% 
  gsub('Interior: |VIN:', '', .) %>% 
  trimws() 
df$vin <- str_extract(df$stats, 'VIN: .*\\$') %>% 
  gsub('VIN: |\\$', '', .) %>% 
  substr(., 1, 17)
df$deal <- str_extract(df$stats, '\\$[0-9]*,[0-9]* below') %>% 
  gsub('below|\\$|,', '', .) %>% trimws() %>%
  as.numeric()

And here’s what the results look like. You’ve got the original scraped data in the stats column and then everything else that you can parse out. Just like that, you’ve got

# df was the dataframe object we needed
df %>% select(-stats) %>% head(10) %>% formattable::formattable()
price year mileage trim awd manual v6 location ext int vin deal
12967 2012 87179 SEL 0 0 0 Long Beach, CA Ingot Silver Metallic Charcoal Black 2FMDK3JCXCBA82243 NA
13333 2012 83415 Limited 0 0 0 Thousand Oaks, CA White Tan 2FMDK3KC6CBA95361 NA
18000 2012 41470 SEL 0 0 0 Woodland Hills, CA NA NA 2FMDK3J94CBA49744 NA
14992 2012 77883 Limited 0 0 0 Mission Hills, CA NA NA 2FMDK3KC2CBA41460 NA
13980 2012 83603 Limited 0 0 0 Irvine, CA Ingot Silver Metallic Charcoal Black 2FMDK3K96CBA46150 NA
17980 2012 58297 Sport 0 0 0 Buena Park, CA Tuxedo Black Metallic Charcoal Black 2FMDK3AK7CBA93272 NA
14777 2012 74332 Limited 0 0 0 Ontario, CA NA NA 2FMDK3K91CBA92484 NA
17399 2012 68191 Sport 0 0 0 Santa Ana, CA Black Charcoal Black 2FMDK3AKXCBA12748 NA
12967 2012 87179 SEL 0 0 0 Long Beach, CA Ingot Silver Metallic Charcoal Black 2FMDK3JCXCBA82243 NA
13333 2012 83415 Limited 0 0 0 Thousand Oaks, CA White Tan 2FMDK3KC6CBA95361 NA

  1. Python’s beautifulSoup package could probably do a better job.

  2. I tried scraping CarGurus, but wasn’t able to paginate. I tried scraping CarMax, but had difficulty. Edmunds was also easy, but Truecar was easiset.