Aggregating Seller Feedback from Amazon.com
I received a request from a professor to aggregate all of the feedback for a given Amazon.com seller. The problem is that Amazon.com seller feedback is displayed 25 at a time on the feedback page and, of course, even if we could get all of them on one page, there is NO way we're going to manually move all the records to a database when there are 50,000+ feedback records.
In this post, I will discuss one approach to solving this problem
Plan A - Use the Amazon.com API
After some research, I came up with this query to pull feedback via their ECS api doodad:
http://ecs.amazonaws.com/onca/xml? Service=AWSECommerceService& AWSAccessKeyId=[myKEYID]& Operation=SellerLookup& SellerId=[SELLERID]& ResponseGroup=Seller& FeedbackPage=1
So that's nice... but it only returns 5 records. On top of that, the valid range of FeedbackPage is 1-10, so it only allows a maximum of 50 (5 records per page by 10 pages) records. Not cool.
Plan B - Scraping HTML pages via XMLHTTP
Step 1: Figure out what pages we want to scrape
Lets start here:
http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&asin=0471789569&marketplaceSeller=1&seller=ADWEDOQFWH4RS
There's a "Next Page" link at the bottom that goes here:
http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&asin=0471789569&pageNumber=1&marketplaceSeller=1&seller=ADWEDOQFWH4RS
Notice that the distinguishing query parameter is pageNumber. So, for simplicity, lets see if pageNumber=0 yields the first page:
http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&asin=0471789569&pageNumber=0&marketplaceSeller=1&seller=ADWEDOQFWH4RS
Ok, that worked as expected. Now, lets see what happens when we get to the last page (actually, the page after last) by thowing in a huge page number:
http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&asin=0471789569&pageNumber=5000&marketplaceSeller=1&seller=ADWEDOQFWH4RS
Perfect - no records. So now for a given seller, we can just start at pageNumber 0 and keep going up until the returned page doesn't have any feedback on it.
Step 2 - What's a record look like?
We need to define a regular expression pattern to identify a feedback record in the HTML pages. Now, this doesn't have to be perfect, it just has to work for now. These patterns will be dependant on the HTML formatting of the site (if they don't use IDs - Amazon.com doesn't) - so if they change formatting too much, we'll have to make a new pattern. No real big deal, but it is important to note that this isn't an exact science. Just make it work.
View the feedback page source and locate a record:
<tr> <td width="120" valign="top" bgcolor="#F6F6F6"> <span class="small"> <font color="#009900"><b>5</b> out of <b>5</b></font>: </span> </td> <td bgcolor="#F6F6F6"> <span class="small"> "none" <br /> Date: 4/4/2008 Rated by Buyer: Jason M. </span> </td> </tr>
Now - I find it's easiest for this sort of thing to just ignore tabs(\t), newline characters(\n) and carridge return characters(\r) because we will just strip all those before we apply the pattern.
Locate the pieces of info you want to scrap in HTML and convert to a regular expression
| Target Data | HTML | RegEx Pattern |
|---|---|---|
| Feedback rating |
<b>5</b> out of <b>5</b> |
<b>(\d)</b> out of <b>(\d)</b> |
| Comment |
<span class="small"> "none" <br /> |
<span class=""small"">(.*?)<br /> |
| Date |
Date: 4/4/2008 |
Date:\s+(\d\d?/\d\d?/\d{4}) |
| User name |
Rated by Buyer: Jason M.
</span> |
Rated by Buyer: (.*?)< |
And then we just glue it all together with LAZY wildcard matches:
<b>(\d)</b> out of <b>(\d)</b>.*?<span class=""small"">(.*?)<br />.*?Date:\s+(\d\d?/\d\d?/\d{4}).*?Rated by Buyer: (.*?)<Step 3: Implement!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | '******************************************************** '* amazonSellerFeedback.vbs '* Tim Laqua, 2008 '* '* Usage: amazonSellerFeedback.vbs SELLERID[,SELLERID[,SELLERID,[...]]] '******************************************************** Set http = createObject("Microsoft.XMLHTTP") Set regEx = New RegExp regEx.IgnoreCase = True regEx.Global = True ' Get the Seller(s) from the command line Set objArgs = WScript.Arguments If objArgs.Count < 1 Then WScript.Echo "Please specify seller ID" WScript.Quit End If ' We allow multiple sellers separated by commas, so split the sellers on a comma arrSellers = Split(objArgs.Item(0), ",") ' Loop for each seller For Each strSeller in arrSellers WScript.Echo "Processing Seller: " & strSeller intPage = 1 ' This pattern identifies a single feedback record strRecordPattern = "<b>(\d)</b> out of <b>(\d)</b>.*?" & _ "<span class=""small"">(.*?)<br />.*?" & _ "Date:\s+(\d\d?/\d\d?/\d{4}).*?" & _ "Rated by Buyer: (.*?)<" intTimerStart = Timer ' Create the output file strFileName = "amazonSellerFeedback_" & strSeller & ".csv" Set objFSO = CreateObject("Scripting.FileSystemObject") Set objTextFile = objFSO.CreateTextFile(strFileName, True) ' Write column headers objTextFile.WriteLine "Seller,Rating,Comment,Date,User" ' Populate strFeedbackText with first feedback page strFeedbackText = getFeedback(intPage) regEx.Pattern = strRecordPattern ' Keep going until the page doesn't have any feedback records on it Do While regEx.Test(strFeedbackText) Wscript.Echo "Processing records " & intPage*25 & "-" & (intPage+1)*25 Set colMatches = regEx.Execute(strFeedbackText) For Each objMatch in colMatches strRecord = strSeller & "," & _ objMatch.SubMatches(0) & "," & _ """" & Replace(Trim(objMatch.SubMatches(2)),"""","""""") & """," & _ Trim(objMatch.SubMatches(3)) & "," & _ """" & Replace(Trim(objMatch.SubMatches(4)),"""","""""") & """" 'Write record to file objTextFile.WriteLine(strRecord) Next intPage = intPage + 1 strFeedbackText = getFeedback(intPage) regEx.Pattern = strRecordPattern Loop WScript.Echo "Processing Finished..." & vbCrLf objTextFile.Close ' Stats and status messages WScript.Echo "Data Saved to " & strFileName intProcTime = Timer - intTimerStart WScript.Echo "Processing Time: " & intProcTime & " seconds" WScript.Echo "Per Record Time: " & Round(intProcTime/((intPage - 1)*25), 4) Next ' Function to retreive a a given feedback page for the current strSeller Function getFeedback(intFeedbackPage) strURL = "http://www.amazon.com/gp/help/seller/feedback.html?" & _ "ie=UTF8&" & _ "asin=0471789569&" & _ "pageNumber=" & intPage & "&" & _ "seller=" & strSeller http.open "GET", strURL, False http.send regEx.Pattern = "\n|\r|\t" strMungedResponse = regEx.Replace(http.responseText, "") getFeedback = strMungedResponse End Function |
Screenshot of the script running in 7 different threads (I got impatient):





April 11th, 2008 - 01:27
Your blog is very informative, I have learned so much from it. It is like daily newspaper
. Added to fav’s.