In another post, I describe how I use this data that I’ve scraped, but I wanted to provide a more in-depth tutorial for those interested in how I got the data. Note, this data belongs to Truecar, so all uses herein are for personal and academic reasons only.

Get the data

In order to do any good analaysis, you first need data. I prefer to have more data than less, where possible. In this case, I don’t have any data, so I use webscraping to get the data. There are much better tutorials on how to scrape data, so I’ll be light. I use R’s rvest package here, which does a decent job.1 Let’s look at Truecar’s Used Car postings2. First I use google to find the search query on Truecar that I like.

# Load packages
library(rvest)
library(dplyr)
library(magrittr)
# Find the URL of the data you want to scrape
url <- 'https://www.truecar.com/used-cars-for-sale/listings/ford/edge/location-90210/year-2015-max/?trimSlug=sel-awd'
read_html(url)
## {xml_document}
## <html lang="en-US" data-qa="Index">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body><div>\n<div><div id="main"><div><div data-qa="TransitionHooks" ...

You’ll see there’s a head and a body. Our data’s in the body, so let’s use html_nodes() and html_text() to parse out the data we want. I used Selectorgadget to know what HTML classes to search for.

read_html(url) %>% html_nodes('.col-xs-6.col-sm-8.no-left-padding') %>% html_text()
##  [1] "2015 Ford EdgeSE FWDMileage: 32,880 milesLocation: Alhambra, CAExterior: GrayInterior: EbonyVIN: 2FMTK3G89FBB59011$17,711View Details"                                                         
##  [2] "2015 Ford EdgeSE FWDMileage: 70,069 milesLocation: Whittier, CAExterior: GreyInterior: BlackVIN: 2FMTK3G84FBB39684Discount Available $13,995View Details"                                      
##  [3] "2015 Ford EdgeSEL FWDMileage: 54,453 milesLocation: Placentia, CAExterior: WhiteInterior: EbonyVIN: 2FMTK3J82FBC23818$15,499View Details"                                                      
##  [4] "2015 Ford EdgeSE FWDMileage: 18,969 milesLocation: Placentia, CAExterior: Tuxedo Black MetallicInterior: EbonyVIN: 2FMTK3G90FBB67457$18,499View Details"                                       
##  [5] "2015 Ford EdgeSE FWDMileage: 18,444 milesLocation: Downey, CAExterior: BlueInterior: BlackVIN: 2FMTK3G9XFBC34534$19,497View Details"                                                           
##  [6] "2015 Ford EdgeSE FWDMileage: 79,835 milesLocation: Santa Ana, CAExterior: Magnetic MetallicInterior: DuneVIN: 2FMTK3G88FBB43320$13,488View Details"                                            
##  [7] "2015 Ford EdgeSEL FWDMileage: 45,104 milesLocation: Downey, CAExterior: WhiteInterior: GoldVIN: 2FMTK3J92FBB73835$17,499View Details"                                                          
##  [8] "2015 Ford EdgeSE FWDMileage: 56,510 milesLocation: Downey, CAExterior: GrayInterior: BeigeVIN: 2FMTK3G85FBB43338$16,000View Details"                                                           
##  [9] "2015 Ford EdgeSE FWDMileage: 77,145 milesLocation: Santa Ana, CAExterior: Oxford WhiteInterior: EbonyVIN: 2FMTK3G90FBB43093$13,998View Details"                                                
## [10] "2015 Ford EdgeSE FWDMileage: 55,315 milesLocation: Alhambra, CAExterior: WhiteInterior: EbonyVIN: 2FMTK3G91FBB67564$17,711View Details"                                                        
## [11] "2015 Ford EdgeSE FWDMileage: 62,747 milesLocation: Montclair, CAExterior: GrayInterior: TanVIN: 2FMTK3G98FBB43553Discount Available  Rating: Great Price$14,995$8,778 below marketView Details"
## [12] "2015 Ford EdgeSE FWDMileage: 35,553 milesLocation: Costa Mesa, CAExterior: Oxford WhiteInterior: BlackVIN: 2FMTK3G88FBB38926$17,790View Details"                                               
## [13] "2015 Ford EdgeSEL FWDMileage: 62,015 milesLocation: Torrance, CAExterior: White Platinum Metallic Tri-CoatInterior: DuneVIN: 2FMTK3J95FBB34110$18,995View Details"                             
## [14] "2015 Ford EdgeSE FWDMileage: 28,875 milesLocation: Costa Mesa, CAExterior: Oxford WhiteInterior: GrayVIN: 2FMTK3G80FBB39150$18,490View Details"                                                
## [15] "2015 Ford EdgeSEL AWDMileage: 26,495 milesLocation: Carson, CAExterior: BlackVIN: 2FMTK4J88FBB21347Discount Available $20,991View Details"                                                     
## [16] "2015 Ford EdgeSEL FWDMileage: 83,984 milesLocation: Van Nuys, CAExterior: Guard MetallicInterior: DuneVIN: 2FMTK3J83FBB04109$15,900View Details"                                               
## [17] "2015 Ford EdgeSE AWDMileage: 35,779 milesLocation: Downey, CAExterior: Dk. GrayInterior: BlackVIN: 2FMTK4G94FBB63109$19,400View Details"                                                       
## [18] "2015 Ford EdgeSEL FWDMileage: 29,516 milesLocation: Torrance, CAExterior: BlackInterior: EbonyVIN: 2FMTK3J97FBB92848$22,897View Details"                                                       
## [19] "2015 Ford EdgeSEL AWDMileage: 18,208 milesLocation: Carson, CAExterior: Ingot Silver MetallicInterior: EbonyVIN: 2FMTK4J82FBC05227Discount Available $22,195View Details"                      
## [20] "2015 Ford EdgeSEL AWDMileage: 26,321 milesLocation: Torrance, CAExterior: Oxford WhiteInterior: EbonyVIN: 2FMTK4J93FBB60937$22,987View Details"                                                
## [21] "2015 Ford EdgeSEL FWDMileage: 34,004 milesLocation: Orange, CAExterior: GrayVIN: 2FMTK3J94FBB67678Discount Available $20,457View Details"                                                      
## [22] "2015 Ford EdgeSEL FWDMileage: 27,108 milesLocation: Montebello, CAExterior: GrayInterior: BlackVIN: 2FMTK3J86FBB60853$21,499View Details"                                                      
## [23] "2015 Ford EdgeSEL FWDMileage: 57,505 milesLocation: Garden Grove, CAExterior: Ingot Silver MetallicInterior: EbonyVIN: 2FMPK3J92FBB74727Discount Available $17,995View Details"                
## [24] "2015 Ford EdgeSEL FWDMileage: 30,445 milesLocation: Orange, CAExterior: WhiteVIN: 2FMTK3J95FBB45589Discount Available $20,997View Details"                                                     
## [25] "2015 Ford EdgeSEL FWDMileage: 17,376 milesLocation: Huntington Beach, CAExterior: White Platinum Metallic Tri-CoatInterior: EbonyVIN: 2FMTK3J91FBC26878$22,790View Details"                    
## [26] "2015 Ford EdgeSEL FWDMileage: 23,339 milesLocation: Placentia, CAExterior: GrayVIN: 2FMTK3J9XFBC39953Discount Available $21,999View Details"                                                   
## [27] "2015 Ford EdgeSEL FWDMileage: 19,157 milesLocation: Downey, CAExterior: SilverInterior: BlackVIN: 2FMTK3J98FBB46896$22,400View Details"                                                        
## [28] "2015 Ford EdgeSEL FWDMileage: 66,586 milesLocation: North Hollywood, CAExterior: Tuxedo Black MInterior: BlackVIN: 2FMTK3J80FBB06271Discount Available $20,995View Details"                    
## [29] "2015 Ford EdgeSEL AWDMileage: 76,547 milesLocation: Downey, CAExterior: Dk. GrayInterior: BeigeVIN: 2FMTK4J90FBB52746$17,700View Details"                                                      
## [30] "2015 Ford EdgeSEL FWDMileage: 32,301 milesLocation: Downey, CAExterior: SilverInterior: BlackVIN: 2FMTK3J82FBB47792$21,400View Details"

So that’s how you get the data on a single page. If you look closer at the URL, you see a lot of helpful things. First, there’s the make, then the model, then the location-zip, then the year-range, and ultimately the trim. This is a very pretty and clean URL. If you click on a few additional pages, you’ll see the URL opens up with ?page=2.

https://www.truecar.com/used-cars-for-sale/listings/ford/edge/location-90210/?page=2

This is our ‘in’ to scraping multiple pages. I won’t bore you with the details of how to get that data into a neat matrix for us to analyze, but suffice it to say that I’m able to do it. Just build a function to construct a URL, and build a loop to go through the different pages, then use lots of str_extract from the stringr package and gsub to clean up the data.

library(stringr)

make = 'ford'
model = 'edge'
zip = '90210'
year = 2012
npages = 5

url <- paste('https://www.truecar.com/used-cars-for-sale/listings/', 
             make, '/', 
             model ,
             '/location-', zip,
             '/year-',year,'-max/?page=', sep = "")

urls <- paste(url, 1:npages, sep = "")

scrape <- function(pageno){
  try(
    read_html(urls[pageno]) %>% html_nodes('.col-xs-6.col-sm-8.no-left-padding') %>% html_text()
  )
}

long_list = scrape(1)
for(i in 2:npages){
  print(i)
  new_list = try(scrape(i))
  
  error = ("try-error" %in% class(new_list))
  
  if( error == FALSE ){
    long_list = c(long_list, new_list) 
  } else {
    break
  }
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
stats <- long_list
df <- as.data.frame(stats)
df$stats %<>% as.character()
df$price <- str_extract(df$stats, '\\$[0-9]*,[0-9]*') %>% 
  gsub('Price: |\\$|,', '', .) %>%
  as.numeric()
df$year <- str_extract(df$stats, '^[0-9]* ') %>% 
  as.numeric()
df$mileage <- str_extract(df$stats, 'Mileage: [0-9]*,[0-9]*') %>% 
  gsub('Mileage: |,', '', .) %>%
  as.numeric()

# a = df$stats[1]
df$trim <- str_extract(df$stats, '.*Mileage:') %>% 
  gsub('FWD|AWD|4x[24]|[24]WD|V6|4-cyl|^[0-9][0-9][0-9][0-9]|4dr|Automatic|Manual|Mileage:', '', ., ignore.case = T) %>% 
  gsub(make, '', ., ignore.case = T) %>% 
  gsub(model, '', ., ignore.case = T) %>% 
  trimws() 


df$awd <- grepl('AWD|4WD|4x4', df$stats, ignore.case = T) %>% as.numeric()
df$manual <- grepl('manual', df$stats) %>% as.numeric()
df$v6 <- grepl('V6', df$stats) %>% as.numeric()
df$location <- str_extract(df$stats, 'Location: .*Exterior:') %>% 
  gsub('Location: |Exterior:', '', .) %>% 
  trimws() 
df$ext <- str_extract(df$stats, 'Exterior: .*Interior:') %>% 
  gsub('Interior:|Exterior:', '', .) %>% 
  trimws() 
df$int <- str_extract(df$stats, 'Interior: .*VIN:') %>% 
  gsub('Interior: |VIN:', '', .) %>% 
  trimws() 
df$vin <- str_extract(df$stats, 'VIN: .*\\$') %>% 
  gsub('VIN: |\\$', '', .) %>% 
  substr(., 1, 17)
df$deal <- str_extract(df$stats, '\\$[0-9]*,[0-9]* below') %>% 
  gsub('below|\\$|,', '', .) %>% trimws() %>%
  as.numeric()

And here’s what the results look like. You’ve got the original scraped data in the stats column and then everything else that you can parse out. Just like that, you’ve got

# df was the dataframe object we needed
df %>% select(-stats) %>% head(10) %>% formattable::formattable()
price year mileage trim awd manual v6 location ext int vin deal
12599 2012 87724 SEL 0 0 0 La Habra, CA Silver Beige 2FMDK3J9XCBA82165 NA
13711 2012 94112 Limited 0 0 0 Alhambra, CA White Platinum Charcoal Black 2FMDK3KC2CBA41538 NA
14986 2012 61006 Limited 0 0 0 Corona, CA Silver Beige 2FMDK3KCXCBA73167 NA
13900 2012 88030 SEL 0 0 0 Downey, CA Green Black 2FMDK4JC2CBA59786 NA
13475 2012 84597 SEL 0 0 0 Ontario, CA White Suede Charcoal Black 2FMDK3J98CBA10168 NA
17991 2012 33261 Limited 0 0 0 Santa Monica, CA NA NA 2FMDK3KC4CBA69535 NA
13500 2012 76689 SEL 0 0 0 Lawndale, CA Brown Gharcl 2FMDK3J92CBA70690 NA
12599 2012 87724 SEL 0 0 0 La Habra, CA Silver Beige 2FMDK3J9XCBA82165 NA
13711 2012 94112 Limited 0 0 0 Alhambra, CA White Platinum Charcoal Black 2FMDK3KC2CBA41538 NA
14986 2012 61006 Limited 0 0 0 Corona, CA Silver Beige 2FMDK3KCXCBA73167 NA

  1. Python’s beautifulSoup package could probably do a better job.

  2. I tried scraping CarGurus, but wasn’t able to paginate. I tried scraping CarMax, but had difficulty. Edmunds was also easy, but Truecar was easiset.