A Method for Measuring “Thanks” Part 2: Scraping Query Results for Analysis in a Collaborative Project
A few weeks ago, I posted the first installment of this series on the methodology behind the “Measuring ‘Thanks’” project that Chris Bourg and I will be presenting at DLF in almost a month. In that post, I covered the query for identifying potential library acknowledgements that we used. Today, I will be covering how to unleash those possible data points from the confining query results page by using a web scraper designed for use with Google Chrome.
The first thing I would like to touch on is that the web scraper discussed in this blog post is not the only one. There are many options for web scraping (and many of them are described in detail here). Since I am writing this post with library folks in mind who may or may not have a programming background, I have decided to demonstrate web scraping using a Google Chrome add-on that not only has an easy-to-use UI, but it has a function to export the results as a spreadsheet to your Google Drive account–making it that much easier to share data for collaboration. If you are interested in learning how to do web scraping directly, via a programming language, I can direct you to two pretty good tutorials:
- If Ruby is your thing, you can check out my colleague Jason Heppler’s blog post on using Nokogiri for scraping, or
- There is as an extremely thorough tutorial on Beautiful Soup, a Python based web scraping suit, at Python for Beginners.
Now, if you haven’t already done so, go ahead and download Google Chrome (click here to download it for your operating system) and the Google Chrome Scraper from the Google Store (click here and follow the directions for getting it installed). Once you have this add-on installed, we can begin the really easy task of unleashing your data from the query results page.
Before we commence with scraping, let’s make this task that much easier on yourself by modifying the search results settings in Google so that you have fewer pages to scrape. On the right side of your results screen, you will see a cog. Click it, and then select “Search Settings.”
On the next screen, you will see numerous options. Let’s focus on the third one, labeled “Results per Page.” If you haven’t already changed this, you will want to move the slider all the way over to the right so that you are displaying 100 results per page. Although modifying this setting will cause Google to display your results more slowly, it really speeds up the scraping process. Once you are finished scraping, if you so desire, come back and change this back to whatever setting you wish.
Now that we have optimized your results display for web scraping, let’s get started scraping! What you will want to do first is to right-click anywhere on the first page of your query results and select “Scrape similar….” What this will do is launch the Google Chrome Scraper.
Once you have clicked “Scrape similar…”, you should see the Scraper interface.
The first thing you will want to do is make sure that you have selected XPath. We will not be doing any of our scraping with jQuery today. All of the web scraping in this tutorial will be done using XPath expressions. However, you have that option if you ever want to harvest data from a website using this tool in the future.
Once you have confirmed that the Selector is set to XPath, you should copy/paste the following reference into the XPath Reference box:
What this expression does is tell the scraper exactly where to go on the query results page. I won’t go into every single aspect of this declaration, but what I will say is that it points directly to each and every one of your Google book results.
Now, we need to identify each piece of information that we want to harvest from the query results page. For the purposes of our study, we obviously wanted the title of each book. Since I have already pointed the scraper to the div container that holds the metadata for each book entry, it is really easy for me to extract the individual pieces of information for each book as an array (which the Google Chrome Scraper does automagically and will be really important when we go to export the data). For our study, we wanted to harvest the title, author(s), year, description (HTML results snippet), and the link for preview in Google Books.
And here are those specific XPath expressions in a format that you can copy/paste into your Scraper:
- Title: ./h3
- Author: ./div/div/div/a
- Year: ./div/div/div
- Description: ./div/div/span
- Link: ./h3/a/@href
The first thing I would like to point out about these XPath expressions for those of you not already familiar with XPath is that they all begin with a period [.]. The Title, Author, Year, Description, and Link XPath expressions are all what we call predicates. And what the period does is tell the scraper that the information we put in the XPath Reference Expression box at the top is the current element to which we would like to append the predicates. Basically, the period [.] saves us more typing. Secondly, you will notice that the Author and Year XPath expressions are extremely similar. The reason for this is that Google renders the Year as the contents of the div that contains a sub-element “a” whose contents are the Author’s name. Basically, Google has made the book’s Year the parent of the author’s name.
Once you have copied/pasted all of the XPath expressions into the Scraper for the data you would like to harvest from the query page, go ahead and click the Scrape button to view your data harvest in the preview window.
Once thing you will notice in your preview window is that the Year contains the author’s name and some other extraneous text. Remember earlier when I pointed out that the div containing the Year data was the father of the a that contained the author’s name? Well, in order to get the year we have to get the rest of that information as well: there is no other way around it. No fear though, you can get rid of that extra information using basic Find/Replace, more advanced Regular Expressions Find/Replace, Google Refine, or any other tool you can think of.
After verifying that you are harvesting the data you actually want, make sure to save this series of expressions as a preset for you to use later. THis way, you don’t have to come back and re-copy/paste everything back into the Scraper. You can do this by clicking the Presets button, assigning a name to this setting (#libthanks Google Books Harvest is our personal favorite), and then hitting Save.
Finally, go ahead and click the Export to Google Docs… button on the bottom-right side of the Scraper window to see your query data transformed into a beautiful spreadsheet in Google Drive. From here, you can begin processing your data: confirming if each book actually contains an acknowledgement for your library, or any other transformation of the data that you desire.
By using the Google Chrome Scraper with its built-in function of exporting to Google Docs, we have the ability to work with others on this dataset in a collaborative manner, as well as transform our results into dynamic visualizations and maps using Google Fusion Tables and Google Maps.
Oh and if you are interested in learning more about the actual results from our analysis of Stanford acknowledgements, come check out our panel at DLF (or watch for a blog post version of that presentation that will likely make its way either to this site or the Feral Librarian).
Until next time….
Update September 25, 2013 –
If you’re interested in seeing how this same process is used with results from a tool other than Google, check out my complementary blog post on the Stanford Digital Humanities website. In that post, I demonstrate the use of Google Chrome Scraper on results from Opening Night! Opera and Oratorio Premieres. This is a Ruby on Rails web application that performs dynamic queries to a SOLR index. If you’re interested in seeing how I unleash metadata about operas and oratorios that were inspired by the works of Shakespeare so that we all can explore the proliferation of his literary influence, check out “Shakespeare Goes to the Opera.”