A Method for Measuring “Thanks” Part 1: A Search for Thankful Candidates
Recently, Chris Bourg shared in her blog how we will be presenting the findings from our “proof of concept” inquiry into assessing library impact by text mining acknowledgements at DLF in Austin, Texas, November 4-6, 2013. We have been given the challenge to present this paper in only 7 minutes. Needless to say, rather than sharing all of the wonderful, nitty-gritty details about our text analysis/mining methodology during the presentation, I thought it might be nice to outline some of the details of our methodology before we share all of our really fascinating results at DLF: an appetizer before the main course, if you will.
Before we were able to analyze who was being thanked in our library and the nature of those acknowledgements, we had to literally search for acknowledgements. Again, keeping in mind that this was a proof-of-concept endeavor, we decided to limit our inquiry to books and to find them using Google. Through a series of experiments, we discovered that the most robust and effective syntax for identifying candidates for our corpus was the following query in Google Books:
I will walk you through this syntax step-by-step and explain why we constructed this the way we did. First, let’s take a look at the individual operators.
- & can be used interchangeably with AND, and signifies words or expressions that all must be found in the search
- | can be used interchangeably with OR to include more than one term to be found in your search
- “” are used to identify words or phrases that must be found exactly as they are typed
- ~ indicates a word that can be found with flexibility (i.e. librarian in addition to library)
- () require that the terms or expressions found within them to be performed first, and is also known as nesting
Thus, our expression can be read as the following narrative:
The reason we decided to include Special Collections is because we were noticing in our initial, experimental queries that authors would most frequently refer to that part of our library, as well as specific archives. It was this same line of logic that caused us to also include Green and Cecil H. Green along with various combinations of Stanford (University Library/ies).
When I submitted this query to Google Books just now, I received “About 981,000 hits.” Now, for the purposes of our specific inquiry, we decided to limit to the last 10 years, so from 2003 to 2013. In order to limit the Google query results to a specific year range, we need to select “Search Tools.” Next, click “Any Time,” and “Custom Range.”
You will notice that a form with calendar view will appear. This is where you need to input the date range in which you are interested. This tool is extremely useful in that you can specify date ranges down to specific days. Keep in mind that if you just put in a year, like in the picture below, it will perform a search from January 1st of the start year to December 31st of the ending year.
Now, you will find that your search has been limited to the range you have specified.
These results are automatically sorted by relevance. However, you may find it useful to change the sorting to Sort by Date.
This completes Part 1 of “A Method for Measuring Thanks.” The next step in this methodology will cover the automated harvesting of the results of our Google Books query for verification before data/text analysis can commence.
I hope that you have found this little tutorial useful, and be on the look out for Part 2 of this methodological series in the coming weeks.