tim laqua dot com Thoughts and Code from Tim Laqua

4Apr/083

Aggregating Seller Feedback from Amazon.com

I received a request from a professor to aggregate all of the feedback for a given Amazon.com seller. The problem is that Amazon.com seller feedback is displayed 25 at a time on the feedback page and, of course, even if we could get all of them on one page, there is NO way we're going to manually move all the records to a database when there are 50,000+ feedback records.

In this post, I will discuss one approach to solving this problem

Plan A - Use the Amazon.com API
After some research, I came up with this query to pull feedback via their ECS api doodad:

http://ecs.amazonaws.com/onca/xml?
  Service=AWSECommerceService&
  AWSAccessKeyId=[myKEYID]&
  Operation=SellerLookup&
  SellerId=[SELLERID]&
  ResponseGroup=Seller&
  FeedbackPage=1

So that's nice... but it only returns 5 records. On top of that, the valid range of FeedbackPage is 1-10, so it only allows a maximum of 50 (5 records per page by 10 pages) records. Not cool.

Plan B - Scraping HTML pages via XMLHTTP
Step 1: Figure out what pages we want to scrape
Lets start here:
http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&asin=0471789569&marketplaceSeller=1&seller=ADWEDOQFWH4RS

There's a "Next Page" link at the bottom that goes here:
http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&asin=0471789569&pageNumber=1&marketplaceSeller=1&seller=ADWEDOQFWH4RS

Notice that the distinguishing query parameter is pageNumber. So, for simplicity, lets see if pageNumber=0 yields the first page:
http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&asin=0471789569&pageNumber=0&marketplaceSeller=1&seller=ADWEDOQFWH4RS

Ok, that worked as expected. Now, lets see what happens when we get to the last page (actually, the page after last) by thowing in a huge page number:
http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&asin=0471789569&pageNumber=5000&marketplaceSeller=1&seller=ADWEDOQFWH4RS

Perfect - no records. So now for a given seller, we can just start at pageNumber 0 and keep going up until the returned page doesn't have any feedback on it.

Step 2 - What's a record look like?
We need to define a regular expression pattern to identify a feedback record in the HTML pages. Now, this doesn't have to be perfect, it just has to work for now. These patterns will be dependant on the HTML formatting of the site (if they don't use IDs - Amazon.com doesn't) - so if they change formatting too much, we'll have to make a new pattern. No real big deal, but it is important to note that this isn't an exact science. Just make it work.

View the feedback page source and locate a record:

            <tr>
              <td width="120" valign="top" bgcolor="#F6F6F6">
                <span class="small">
                  <font color="#009900"><b>5</b> out of <b>5</b></font>:
                </span>
              </td>
              <td bgcolor="#F6F6F6">
                <span class="small">
                  "none"
                  <br />
                  Date: 4/4/2008 &nbsp;&nbsp;&nbsp; Rated by Buyer: Jason M.
                </span>
              </td>
            </tr>

Now - I find it's easiest for this sort of thing to just ignore tabs(\t), newline characters(\n) and carridge return characters(\r) because we will just strip all those before we apply the pattern.

Locate the pieces of info you want to scrap in HTML and convert to a regular expression

Target Data
HTML RegEx Pattern
Feedback rating
<b>5</b> out of <b>5</b>
<b>(\d)</b> out of <b>(\d)</b>
Comment
                <span class="small">
                  "none"
                  <br />
<span class=""small"">(.*?)<br />
Date
Date: 4/4/2008
Date:\s+(\d\d?/\d\d?/\d{4})
User name
Rated by Buyer: Jason M.
                </span>
Rated by Buyer: (.*?)<

And then we just glue it all together with LAZY wildcard matches:

<b>(\d)</b> out of <b>(\d)</b>.*?<span class=""small"">(.*?)<br />.*?Date:\s+(\d\d?/\d\d?/\d{4}).*?Rated by Buyer: (.*?)<

Step 3: Implement!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
'********************************************************
'* amazonSellerFeedback.vbs
'* Tim Laqua, 2008
'* 
'* Usage: amazonSellerFeedback.vbs SELLERID[,SELLERID[,SELLERID,[...]]]
'********************************************************
Set http = createObject("Microsoft.XMLHTTP")
Set regEx = New RegExp
regEx.IgnoreCase = True
regEx.Global = True
 
' Get the Seller(s) from the command line
Set objArgs = WScript.Arguments
If objArgs.Count < 1 Then
  WScript.Echo "Please specify seller ID"
  WScript.Quit
End If
 
' We allow multiple sellers separated by commas, so split the sellers on a comma
arrSellers = Split(objArgs.Item(0), ",")
 
' Loop for each seller
For Each strSeller in arrSellers
  WScript.Echo "Processing Seller: " & strSeller
  intPage = 1
 
  ' This pattern identifies a single feedback record
  strRecordPattern = "<b>(\d)</b> out of <b>(\d)</b>.*?" & _
                            "<span class=""small"">(.*?)<br />.*?" & _
                            "Date:\s+(\d\d?/\d\d?/\d{4}).*?" & _
                            "Rated by Buyer: (.*?)<"
 
  intTimerStart = Timer
 
  ' Create the output file
  strFileName = "amazonSellerFeedback_" & strSeller & ".csv"
  Set objFSO = CreateObject("Scripting.FileSystemObject")
  Set objTextFile = objFSO.CreateTextFile(strFileName, True)
 
  ' Write column headers
  objTextFile.WriteLine "Seller,Rating,Comment,Date,User"
 
  ' Populate strFeedbackText with first feedback page
  strFeedbackText = getFeedback(intPage)
  regEx.Pattern = strRecordPattern
 
  ' Keep going until the page doesn't have any feedback records on it
  Do While regEx.Test(strFeedbackText)
    Wscript.Echo "Processing records " & intPage*25 & "-" & (intPage+1)*25
 
    Set colMatches = regEx.Execute(strFeedbackText)
    For Each objMatch in colMatches
      strRecord = strSeller & "," & _
            objMatch.SubMatches(0) & "," & _
            """" & Replace(Trim(objMatch.SubMatches(2)),"""","""""") & """," & _
            Trim(objMatch.SubMatches(3)) & "," & _
            """" & Replace(Trim(objMatch.SubMatches(4)),"""","""""") & """"
 
      'Write record to file
      objTextFile.WriteLine(strRecord)
 
    Next
    intPage = intPage + 1
    strFeedbackText = getFeedback(intPage)
    regEx.Pattern = strRecordPattern
  Loop 
 
  WScript.Echo "Processing Finished..." & vbCrLf
  objTextFile.Close
 
  ' Stats and status messages
  WScript.Echo "Data Saved to " & strFileName
  intProcTime = Timer - intTimerStart
  WScript.Echo "Processing Time: " & intProcTime & " seconds"
  WScript.Echo "Per Record Time: " & Round(intProcTime/((intPage - 1)*25), 4)
Next
 
' Function to retreive a a given feedback page for the current strSeller
Function getFeedback(intFeedbackPage)
  strURL =   "http://www.amazon.com/gp/help/seller/feedback.html?" & _
        "ie=UTF8&" & _
        "asin=0471789569&" & _
        "pageNumber=" & intPage & "&" & _
        "seller=" & strSeller
 
  http.open "GET", strURL, False
  http.send
 
  regEx.Pattern = "\n|\r|\t"
  strMungedResponse = regEx.Replace(http.responseText, "")
 
  getFeedback = strMungedResponse
End Function

Screenshot of the script running in 7 different threads (I got impatient):

Comments (3) Trackbacks (0)
  1. Your blog is very informative, I have learned so much from it. It is like daily newspaper :). Added to fav’s.

  2. Could you maybe update this…. I’m trying but I don’t know .vbs.

    I only want to do 1 seller… so no looping
    i don’t need the pop up windows telling me what’s going on..

    I tried changing the section to:

    strRecordPattern = “(\d\d?/\d\d?/\d{4}).*?” & _
    <\d).*?” & _
    “(.*?).*?” & _
    “orderID=(.*?)””>(.*?).*?” & _
    “buyerID=(.*?)””>(.*?)”

    intTimerStart = Timer

    ‘ Create the output file
    strFileName = “ASFBack_” & strSeller & “.csv”
    Set objFSO = CreateObject(“Scripting.FileSystemObject”)
    Set objTextFile = objFSO.CreateTextFile(strFileName, True)

    ‘ Write column headers
    objTextFile.WriteLine “Date,Rating,Comment,OderID,Buyer”

    and then the page section to:

    Function getFeedback(intFeedbackPage)
    strURL = “https://sellercentral.amazon.com/gp/feedback-manager/view-all-feedback.html?” & _
    “ie=UTF8&” & _
    “sortType=sortByDate&” & _
    “pageSize=50&” & _
    “dateRange=&” & _
    “currentPage=” & intPage & “&” & _
    “descendingOrder=1”

    I know it also needs to change out… objMatch.SubMatches

    I would appreciate any help……..

  3. oops … try again on the first section….

    strRecordPattern =""(\d\d?/\d\d?/\d{4}).*?"" & _
                                "<\d).*?" & _
                                "(.*?).*?" & _
    							""orderID=(.*?)"">(.*?).*?"" & _
                                "buyerID=(.*?)"">(.*?)"

    If I’m not smart enough to get the HTML to show correctly this time I’m sure you can….. here goes….


Leave a comment

No trackbacks yet.