tim laqua dot com Thoughts and Code from Tim Laqua


Aggregating Seller Feedback from Amazon.com

I received a request from a professor to aggregate all of the feedback for a given Amazon.com seller. The problem is that Amazon.com seller feedback is displayed 25 at a time on the feedback page and, of course, even if we could get all of them on one page, there is NO way we're going to manually move all the records to a database when there are 50,000+ feedback records.

In this post, I will discuss one approach to solving this problem

Plan A - Use the Amazon.com API
After some research, I came up with this query to pull feedback via their ECS api doodad:


So that's nice... but it only returns 5 records. On top of that, the valid range of FeedbackPage is 1-10, so it only allows a maximum of 50 (5 records per page by 10 pages) records. Not cool.

Plan B - Scraping HTML pages via XMLHTTP
Step 1: Figure out what pages we want to scrape
Lets start here:

There's a "Next Page" link at the bottom that goes here:

Notice that the distinguishing query parameter is pageNumber. So, for simplicity, lets see if pageNumber=0 yields the first page:

Ok, that worked as expected. Now, lets see what happens when we get to the last page (actually, the page after last) by thowing in a huge page number:

Perfect - no records. So now for a given seller, we can just start at pageNumber 0 and keep going up until the returned page doesn't have any feedback on it.

Step 2 - What's a record look like?
We need to define a regular expression pattern to identify a feedback record in the HTML pages. Now, this doesn't have to be perfect, it just has to work for now. These patterns will be dependant on the HTML formatting of the site (if they don't use IDs - Amazon.com doesn't) - so if they change formatting too much, we'll have to make a new pattern. No real big deal, but it is important to note that this isn't an exact science. Just make it work.

View the feedback page source and locate a record:

                  5 out of 5:
Date: 4/4/2008     Rated by Buyer: Jason M.

Now - I find it's easiest for this sort of thing to just ignore tabs(\t), newline characters(\n) and carridge return characters(\r) because we will just strip all those before we apply the pattern.

Locate the pieces of info you want to scrap in HTML and convert to a regular expression

Target Data
HTML RegEx Pattern
Feedback rating
5 out of 5
(\d) out of (\d)
Date: 4/4/2008
User name
Rated by Buyer: Jason M.
Rated by Buyer: (.*?)<

And then we just glue it all together with LAZY wildcard matches:

(\d) out of (\d).*?(.*?)
.*?Date:\s+(\d\d?/\d\d?/\d{4}).*?Rated by Buyer: (.*?)<

Step 3: Implement!

'* amazonSellerFeedback.vbs
'* Tim Laqua, 2008
'* Usage: amazonSellerFeedback.vbs SELLERID[,SELLERID[,SELLERID,[...]]]
Set http = createObject("Microsoft.XMLHTTP")
Set regEx = New RegExp
regEx.IgnoreCase = True
regEx.Global = True

' Get the Seller(s) from the command line
Set objArgs = WScript.Arguments
If objArgs.Count < 1 Then
  WScript.Echo "Please specify seller ID"
End If

' We allow multiple sellers separated by commas, so split the sellers on a comma
arrSellers = Split(objArgs.Item(0), ",")

' Loop for each seller
For Each strSeller in arrSellers
  WScript.Echo "Processing Seller: " & strSeller
  intPage = 1
  ' This pattern identifies a single feedback record
  strRecordPattern = "(\d) out of (\d).*?" & _
.*?" & _ "Date:\s+(\d\d?/\d\d?/\d{4}).*?" & _ "Rated by Buyer: (.*?)<" intTimerStart = Timer ' Create the output file strFileName = "amazonSellerFeedback_" & strSeller & ".csv" Set objFSO = CreateObject("Scripting.FileSystemObject") Set objTextFile = objFSO.CreateTextFile(strFileName, True) ' Write column headers objTextFile.WriteLine "Seller,Rating,Comment,Date,User" ' Populate strFeedbackText with first feedback page strFeedbackText = getFeedback(intPage) regEx.Pattern = strRecordPattern ' Keep going until the page doesn't have any feedback records on it Do While regEx.Test(strFeedbackText) Wscript.Echo "Processing records " & intPage*25 & "-" & (intPage+1)*25 Set colMatches = regEx.Execute(strFeedbackText) For Each objMatch in colMatches strRecord = strSeller & "," & _ objMatch.SubMatches(0) & "," & _ """" & Replace(Trim(objMatch.SubMatches(2)),"""","""""") & """," & _ Trim(objMatch.SubMatches(3)) & "," & _ """" & Replace(Trim(objMatch.SubMatches(4)),"""","""""") & """" 'Write record to file objTextFile.WriteLine(strRecord) Next intPage = intPage + 1 strFeedbackText = getFeedback(intPage) regEx.Pattern = strRecordPattern Loop WScript.Echo "Processing Finished..." & vbCrLf objTextFile.Close ' Stats and status messages WScript.Echo "Data Saved to " & strFileName intProcTime = Timer - intTimerStart WScript.Echo "Processing Time: " & intProcTime & " seconds" WScript.Echo "Per Record Time: " & Round(intProcTime/((intPage - 1)*25), 4) Next ' Function to retreive a a given feedback page for the current strSeller Function getFeedback(intFeedbackPage) strURL = "http://www.amazon.com/gp/help/seller/feedback.html?" & _ "ie=UTF8&" & _ "asin=0471789569&" & _ "pageNumber=" & intPage & "&" & _ "seller=" & strSeller http.open "GET", strURL, False http.send regEx.Pattern = "\n|\r|\t" strMungedResponse = regEx.Replace(http.responseText, "") getFeedback = strMungedResponse End Function

Screenshot of the script running in 7 different threads (I got impatient):

Comments (3) Trackbacks (0)
  1. Your blog is very informative, I have learned so much from it. It is like daily newspaper :). Added to fav’s.

  2. Could you maybe update this…. I’m trying but I don’t know .vbs.

    I only want to do 1 seller… so no looping
    i don’t need the pop up windows telling me what’s going on..

    I tried changing the section to:

    strRecordPattern = “(\d\d?/\d\d?/\d{4}).*?” & _
    <\d).*?” & _
    “(.*?).*?” & _
    “orderID=(.*?)””>(.*?).*?” & _

    intTimerStart = Timer

    ‘ Create the output file
    strFileName = “ASFBack_” & strSeller & “.csv”
    Set objFSO = CreateObject(“Scripting.FileSystemObject”)
    Set objTextFile = objFSO.CreateTextFile(strFileName, True)

    ‘ Write column headers
    objTextFile.WriteLine “Date,Rating,Comment,OderID,Buyer”

    and then the page section to:

    Function getFeedback(intFeedbackPage)
    strURL = “https://sellercentral.amazon.com/gp/feedback-manager/view-all-feedback.html?” & _
    “ie=UTF8&” & _
    “sortType=sortByDate&” & _
    “pageSize=50&” & _
    “dateRange=&” & _
    “currentPage=” & intPage & “&” & _

    I know it also needs to change out… objMatch.SubMatches

    I would appreciate any help……..

  3. oops … try again on the first section….

    strRecordPattern =""(\d\d?/\d\d?/\d{4}).*?"" & _
                                "<\d).*?" & _
                                "(.*?).*?" & _
    							""orderID=(.*?)"">(.*?).*?"" & _

    If I’m not smart enough to get the HTML to show correctly this time I’m sure you can….. here goes….

Leave a comment

No trackbacks yet.