<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>tim laqua dot com &#187; ecommerce</title>
	<atom:link href="http://timlaqua.com/tag/ecommerce/feed/" rel="self" type="application/rss+xml" />
	<link>http://timlaqua.com</link>
	<description>Thoughts and Code from Tim Laqua</description>
	<lastBuildDate>Sun, 09 May 2010 15:25:58 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Aggregating Seller Feedback from Amazon.com</title>
		<link>http://timlaqua.com/2008/04/aggregating-seller-feedback-from-amazoncom/</link>
		<comments>http://timlaqua.com/2008/04/aggregating-seller-feedback-from-amazoncom/#comments</comments>
		<pubDate>Sat, 05 Apr 2008 05:35:12 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[Scripts & Code]]></category>
		<category><![CDATA[ecommerce]]></category>
		<category><![CDATA[vbscript]]></category>
		<category><![CDATA[xmlhttp]]></category>

		<guid isPermaLink="false">http://timlaqua.com/?p=4</guid>
		<description><![CDATA[I received a request from a professor to aggregate all of the feedback for a given Amazon.com seller. The problem is that Amazon.com seller feedback is displayed 25 at a time on the feedback page and, of course, even if we could get all of them on one page, there is NO way we're going [...]]]></description>
			<content:encoded><![CDATA[<p>I received a request from a professor to aggregate all of the feedback for a given Amazon.com seller.  The problem is that Amazon.com seller feedback is displayed 25 at a time on the feedback page and, of course, even if we could get all of them on one page, there is NO way we're going to manually move all the records to a database when there are 50,000+ feedback records.</p>
<p>In this post, I will discuss one approach to solving this problem<br />
<span id="more-4"></span><br />
<u><strong>Plan A - Use the Amazon.com API</strong></u><br />
After some research, I came up with this query to pull feedback via their ECS api doodad:</p>
<pre>

http://ecs.amazonaws.com/onca/xml?

  Service=AWSECommerceService&#038;
  AWSAccessKeyId=[myKEYID]&#038;
  Operation=SellerLookup&#038;
  SellerId=[SELLERID]&#038;
  ResponseGroup=Seller&#038;
  FeedbackPage=1
</pre>
<p>So that's nice... but it only returns 5 records.  On top of that, the valid range of FeedbackPage is 1-10, so it only allows a maximum of 50 (5 records per page by 10 pages) records.  Not cool.</p>
<p><u><strong>Plan B - Scraping HTML pages via XMLHTTP</strong></u><br />
<strong>Step 1: Figure out what pages we want to scrape</strong><br />
Lets start here:<br />
<a href="http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&#038;asin=0471789569&#038;marketplaceSeller=1&#038;seller=ADWEDOQFWH4RS">http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&#038;asin=0471789569&#038;marketplaceSeller=1&#038;seller=ADWEDOQFWH4RS</a></p>
<p>There's a "Next Page" link at the bottom that goes here:<br />
<a href="http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&#038;asin=0471789569&#038;pageNumber=1&#038;marketplaceSeller=1&#038;seller=ADWEDOQFWH4RS">http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&#038;asin=0471789569&#038;pageNumber=1&#038;marketplaceSeller=1&#038;seller=ADWEDOQFWH4RS</a></p>
<p>Notice that the distinguishing query parameter is <strong>pageNumber</strong>.  So, for simplicity, lets see if pageNumber=0 yields the first page:<br />
<a href="http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&#038;asin=0471789569&#038;pageNumber=0&#038;marketplaceSeller=1&#038;seller=ADWEDOQFWH4RS">http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&#038;asin=0471789569&#038;pageNumber=0&#038;marketplaceSeller=1&#038;seller=ADWEDOQFWH4RS</a></p>
<p>Ok, that worked as expected.  Now, lets see what happens when we get to the last page (actually, the page after last) by thowing in a huge page number:<br />
<a href="http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&#038;asin=0471789569&#038;pageNumber=5000&#038;marketplaceSeller=1&#038;seller=ADWEDOQFWH4RS">http://www.amazon.com/gp/help/seller/feedback.html?ie=UTF8&#038;asin=0471789569&#038;pageNumber=5000&#038;marketplaceSeller=1&#038;seller=ADWEDOQFWH4RS</a></p>
<p>Perfect - no records.  So now for a given seller, we can just start at pageNumber 0 and keep going up until the returned page doesn't have any feedback on it.</p>
<p><strong>Step 2 - What's a record look like?</strong><br />
We need to define a regular expression pattern to identify a feedback record in the HTML pages.  Now, this doesn't have to be perfect, it just has to work for now.  These patterns will be dependant on the HTML formatting of the site (if they don't use IDs - Amazon.com doesn't) - so if they change formatting too much, we'll have to make a new pattern.  No real big deal, but it is important to note that this isn't an exact science.  Just make it work.</p>
<p>View the feedback page source and locate a record:</p>

<div class="wp_syntax"><div class="code"><pre class="html4strict" style="font-family:monospace;">            <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">tr</span>&gt;</span>
              <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">td</span> <span style="color: #000066;">width</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;120&quot;</span> <span style="color: #000066;">valign</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;top&quot;</span> <span style="color: #000066;">bgcolor</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;#F6F6F6&quot;</span>&gt;</span>
                <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">span</span> <span style="color: #000066;">class</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;small&quot;</span>&gt;</span>
                  <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">font</span> <span style="color: #000066;">color</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;#009900&quot;</span>&gt;&lt;<span style="color: #000000; font-weight: bold;">b</span>&gt;</span>5<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">b</span>&gt;</span> out of <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">b</span>&gt;</span>5<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">b</span>&gt;&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">font</span>&gt;</span>:
                <span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">span</span>&gt;</span>
              <span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">td</span>&gt;</span>
              <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">td</span> <span style="color: #000066;">bgcolor</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;#F6F6F6&quot;</span>&gt;</span>
                <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">span</span> <span style="color: #000066;">class</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;small&quot;</span>&gt;</span>
                  &quot;none&quot;
                  <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">br</span> <span style="color: #66cc66;">/</span>&gt;</span>
                  Date: 4/4/2008 <span style="color: #ddbb00;">&amp;nbsp;&amp;nbsp;&amp;nbsp;</span> Rated by Buyer: Jason M.
                <span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">span</span>&gt;</span>
              <span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">td</span>&gt;</span>
            <span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">tr</span>&gt;</span></pre></div></div>

<p>Now - I find it's easiest for this sort of thing to just ignore tabs(\t), newline characters(\n) and carridge return characters(\r) because we will just strip all those before we apply the pattern.</p>
<p>Locate the pieces of info you want to scrap in HTML and convert to a regular expression</p>
<table>
<tr>
<th>Target Data</tr>
<th>HTML</th>
<th>RegEx Pattern</tr>
</tr>
<tr>
<td>Feedback rating</td>
<td>

<div class="wp_syntax"><div class="code"><pre class="html4strict" style="font-family:monospace;"><span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">b</span>&gt;</span>5<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">b</span>&gt;</span> out of <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">b</span>&gt;</span>5<span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">b</span>&gt;</span></pre></div></div>

</td>
<td>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&lt;b&gt;(\d)&lt;/b&gt; out of &lt;b&gt;(\d)&lt;/b&gt;</pre></div></div>

</td>
</tr>
<tr>
<td>Comment</td>
<td>

<div class="wp_syntax"><div class="code"><pre class="html4strict" style="font-family:monospace;">                <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">span</span> <span style="color: #000066;">class</span><span style="color: #66cc66;">=</span><span style="color: #ff0000;">&quot;small&quot;</span>&gt;</span>
                  &quot;none&quot;
                  <span style="color: #009900;">&lt;<span style="color: #000000; font-weight: bold;">br</span> <span style="color: #66cc66;">/</span>&gt;</span></pre></div></div>

</td>
<td>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&lt;span class=&quot;&quot;small&quot;&quot;&gt;(.*?)&lt;br /&gt;</pre></div></div>

</td>
</tr>
<tr>
<td>Date</td>
<td>

<div class="wp_syntax"><div class="code"><pre class="html4strict" style="font-family:monospace;">Date: 4/4/2008</pre></div></div>

</td>
<td>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">Date:\s+(\d\d?/\d\d?/\d{4})</pre></div></div>

</td>
</tr>
<tr>
<td>User name</td>
<td>

<div class="wp_syntax"><div class="code"><pre class="html4strict" style="font-family:monospace;">Rated by Buyer: Jason M.
                <span style="color: #009900;">&lt;<span style="color: #66cc66;">/</span><span style="color: #000000; font-weight: bold;">span</span>&gt;</span></pre></div></div>

</td>
<td>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">Rated by Buyer: (.*?)&lt;</pre></div></div>

</td>
</tr>
</table>
<p>And then we just glue it all together with LAZY wildcard matches:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&lt;b&gt;(\d)&lt;/b&gt; out of &lt;b&gt;(\d)&lt;/b&gt;.*?&lt;span class=&quot;&quot;small&quot;&quot;&gt;(.*?)&lt;br /&gt;.*?Date:\s+(\d\d?/\d\d?/\d{4}).*?Rated by Buyer: (.*?)&lt;</pre></div></div>

<p><strong>Step 3: Implement!</strong></p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
</pre></td><td class="code"><pre class="vb" style="font-family:monospace;"><span style="color: #008000;">'********************************************************
</span><span style="color: #008000;">'* amazonSellerFeedback.vbs
</span><span style="color: #008000;">'* Tim Laqua, 2008
</span><span style="color: #008000;">'* 
</span><span style="color: #008000;">'* Usage: amazonSellerFeedback.vbs SELLERID[,SELLERID[,SELLERID,[...]]]
</span><span style="color: #008000;">'********************************************************
</span><span style="color: #000080;">Set</span> http = createObject(&quot;Microsoft.XMLHTTP&quot;)
<span style="color: #000080;">Set</span> regEx = <span style="color: #000080;">New</span> RegExp
regEx.IgnoreCase = <span style="color: #000080;">True</span>
regEx.Global = <span style="color: #000080;">True</span>
&nbsp;
<span style="color: #008000;">' Get the Seller(s) from the command line
</span><span style="color: #000080;">Set</span> objArgs = WScript.Arguments
<span style="color: #000080;">If</span> objArgs.Count &lt; 1 <span style="color: #000080;">Then</span>
  WScript.Echo &quot;Please specify seller ID&quot;
  WScript.Quit
<span style="color: #000080;">End</span> <span style="color: #000080;">If</span>
&nbsp;
<span style="color: #008000;">' We allow multiple sellers separated by commas, so split the sellers on a comma
</span>arrSellers = Split(objArgs.Item(0), &quot;,&quot;)
&nbsp;
<span style="color: #008000;">' Loop for each seller
</span><span style="color: #000080;">For</span> <span style="color: #000080;">Each</span> strSeller <span style="color: #000080;">in</span> arrSellers
  WScript.Echo &quot;Processing Seller: &quot; &amp; strSeller
  intPage = 1
&nbsp;
  <span style="color: #008000;">' This pattern identifies a single feedback record
</span>  strRecordPattern = &quot;&lt;b&gt;(\d)&lt;/b&gt; out of &lt;b&gt;(\d)&lt;/b&gt;.*?&quot; &amp; _
                            &quot;&lt;span class=&quot;&quot;small&quot;&quot;&gt;(.*?)&lt;br /&gt;.*?&quot; &amp; _
                            &quot;Date:\s+(\d\d?/\d\d?/\d{4}).*?&quot; &amp; _
                            &quot;Rated by Buyer: (.*?)&lt;&quot;
&nbsp;
  intTimerStart = Timer
&nbsp;
  <span style="color: #008000;">' Create the output file
</span>  strFileName = &quot;amazonSellerFeedback_&quot; &amp; strSeller &amp; &quot;.csv&quot;
  <span style="color: #000080;">Set</span> objFSO = CreateObject(&quot;Scripting.FileSystemObject&quot;)
  <span style="color: #000080;">Set</span> objTextFile = objFSO.CreateTextFile(strFileName, <span style="color: #000080;">True</span>)
&nbsp;
  <span style="color: #008000;">' Write column headers
</span>  objTextFile.WriteLine &quot;Seller,Rating,Comment,<span style="color: #000080;">Date</span>,User&quot;
&nbsp;
  <span style="color: #008000;">' Populate strFeedbackText with first feedback page
</span>  strFeedbackText = getFeedback(intPage)
  regEx.Pattern = strRecordPattern
&nbsp;
  <span style="color: #008000;">' Keep going until the page doesn't have any feedback records on it
</span>  <span style="color: #000080;">Do</span> <span style="color: #000080;">While</span> regEx.Test(strFeedbackText)
    Wscript.Echo &quot;Processing records &quot; &amp; intPage*25 &amp; &quot;-&quot; &amp; (intPage+1)*25
&nbsp;
    <span style="color: #000080;">Set</span> colMatches = regEx.Execute(strFeedbackText)
    <span style="color: #000080;">For</span> <span style="color: #000080;">Each</span> objMatch <span style="color: #000080;">in</span> colMatches
      strRecord = strSeller &amp; &quot;,&quot; &amp; _
            objMatch.SubMatches(0) &amp; &quot;,&quot; &amp; _
            &quot;&quot;&quot;&quot; &amp; Replace(Trim(objMatch.SubMatches(2)),&quot;&quot;&quot;&quot;,&quot;&quot;&quot;&quot;&quot;&quot;) &amp; &quot;&quot;&quot;,&quot; &amp; _
            Trim(objMatch.SubMatches(3)) &amp; &quot;,&quot; &amp; _
            &quot;&quot;&quot;&quot; &amp; Replace(Trim(objMatch.SubMatches(4)),&quot;&quot;&quot;&quot;,&quot;&quot;&quot;&quot;&quot;&quot;) &amp; &quot;&quot;&quot;&quot;
&nbsp;
      <span style="color: #008000;">'Write record to file
</span>      objTextFile.WriteLine(strRecord)
&nbsp;
    <span style="color: #000080;">Next</span>
    intPage = intPage + 1
    strFeedbackText = getFeedback(intPage)
    regEx.Pattern = strRecordPattern
  <span style="color: #000080;">Loop</span> 
&nbsp;
  WScript.Echo &quot;Processing Finished...&quot; &amp; vbCrLf
  objTextFile.<span style="color: #000080;">Close</span>
&nbsp;
  <span style="color: #008000;">' Stats and status messages
</span>  WScript.Echo &quot;Data Saved <span style="color: #000080;">to</span> &quot; &amp; strFileName
  intProcTime = Timer - intTimerStart
  WScript.Echo &quot;Processing Time: &quot; &amp; intProcTime &amp; &quot; seconds&quot;
  WScript.Echo &quot;Per Record Time: &quot; &amp; Round(intProcTime/((intPage - 1)*25), 4)
<span style="color: #000080;">Next</span>
&nbsp;
<span style="color: #008000;">' Function to retreive a a given feedback page for the current strSeller
</span><span style="color: #000080;">Function</span> getFeedback(intFeedbackPage)
  strURL =   &quot;http://www.amazon.com/gp/help/seller/feedback.html?&quot; &amp; _
        &quot;ie=UTF8&amp;&quot; &amp; _
        &quot;asin=0471789569&amp;&quot; &amp; _
        &quot;pageNumber=&quot; &amp; intPage &amp; &quot;&amp;&quot; &amp; _
        &quot;seller=&quot; &amp; strSeller
&nbsp;
  http.<span style="color: #000080;">open</span> &quot;GET&quot;, strURL, <span style="color: #000080;">False</span>
  http.send
&nbsp;
  regEx.Pattern = &quot;\n|\r|\t&quot;
  strMungedResponse = regEx.Replace(http.responseText, &quot;&quot;)
&nbsp;
  getFeedback = strMungedResponse
<span style="color: #000080;">End</span> <span style="color: #000080;">Function</span></pre></td></tr></table></div>

<p>Screenshot of the script running in 7 different threads (I got impatient):<br />
<a href="/wp-content/uploads/2008/04/amazonsellerfeedback_7threads_done.jpg" title="Enlarge"><img src="/wp-content/uploads/2008/04/amazonsellerfeedback_7threads_done.jpg" width="600px" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://timlaqua.com/2008/04/aggregating-seller-feedback-from-amazoncom/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
