A Linguist in the Library
Hi everyone! Unfortunately, I will not be at DLF today for the presentation Chris Bourg and I are giving on our acknowledgement mining project–I’m hanging out with some amazing DH’ers at #codespeak in the UVA Scholar’s Lab. Thus, my virtual avatar will have the task of sharing the methodology and results from this project. Fortunately for all of you not at DLF, you can watch my pre-recorded portion of the presentation….
We also wanted to make available the graphs that I talk about in the presentation, so here they are as well.
Frequency of Specific Libraries Acknowledged in the Data
Frequency of Departments Acknowledged in the Data
LC Subject Area Frequency
#1 Philology & Linguistics Subject Breakdown
#2 History Subject Breakdown
#3 Social Science Subject Breakdown
Comparison of Acknowledgement Author Institutions (2003 vs. 2013)
Difference in Acknowledgement Author Institutions (2003 vs. 2013)
2003 Author Institution Geographic Heat Map
2003 Author Institution Geographic Heat Map
As a Bonus Feature, here are a few graphs that I did not have time to share and explain during the DLF presentation. This first graph is a comparison of Public vs. Private Author Educational Institutions (2003 vs. 2013):
Next, we were also able to mine the sentences where Stanford University Libraries was acknowledged for other institutions/libraries that were also thanked to uncover our peer network:
Thanks for reading, and I hope that you have found this data/the presentation quite useful.
In this week’s post, I want to address a question that I have been asked quite often since coming to Stanford University Libraries, and one that made its way into a comment of Michael Widner’s blog post recounting a recent discussion in the DH reading group here at Stanford on Matt Jocker’s new book, Macroanalysis: “I’m intrigued by the division within linguistics that Hettel describes; if there were a link to a blog post or something like that explaining it, I’d be interested in following up and learning more.” I searched for an already existing blog post on this topic, and came up empty. Bill Kretzschmar does discuss this dichotomy in linguistics in Chapter 2 of Linguistics of Speech, titled “Saussure,” so it is from this chapter that I will draw the foundation for a portrait of computational language analysis. However, if this is something that you are interested in, I strongly suggest you read this chapter for yourself–especially if you are wanting more after reading this post.
There are many ideologies at play within the field of linguistics as it exists today, but for the most part we can trace all of them back to the person I like to call the grandfather of modern linguistics: Ferdinand de Saussure. Saussure was a Swiss linguist and semiotician during the late 19th century and early 20th century and is considered by most scholars to be one of the founders of modern linguistics. Even more interestingly, his book on general linguistics from which we learn about his theories for investigating language, Cours de linguistique générale (Course in General Linguistics), was published after his death from the notes of his former students Charles Bally and Albert Sechehaye. In the Course, Saussure is recounted as proposing that there are two faces to linguistic inquiry: langue and parole.
Langue is the linguistics of language structure, for example grammar or language rules and systems. Saussure is attributed as saying that two of the aims of langue are “to describe all known languages and record their history, and to determine the forces operating permanently and universally in all languages, and to formulate general laws which account for all particular linguistic phenomena historically attested” (Kretzschmar 40). It should also be mentioned that Saussure is noted as proclaiming the importance of langue over parole. However, “the choice of choice of linguistic structure is not inevitable, not ‘natural’ in the sense that it corresponds to an inborn faculty or property of a species; it is the nucleus of an argument to create a science of linguistics, one based on a model with particular premises and with a definite arrangement of its variables” (Kretzschmar 44). Moreover, Saussure himself is even noted as saying that that langue is not necessarily more important than parole, but that during his time it was not possible to truly investigate parole (something that technology has since helped to remedy).
Parole is what people actually say: speech, or the individual combinations of words, which can be influenced by the will of the speakers, as well as voluntary physical factors (like phonation). From this perspective, parole (or speech) can be extended to language behavior, how people actually use language: both written and spoken. Ultimately, parole entails the voluntary language acts that result from an individual speaker/writer’s will. So, the linguistics of speech, or the study of language behavior, views linguistic acts as individual and ephemeral: they are always context dependent, and we can better understand the behaviors of groups of these individuals by the patterns that emerge from the aggregate of their spoken or written language use.
The proliferation of Saussure’s ideas posthumously led to some very different approaches to linguistic inquiry in the 20th century. Today, I want to focus on two men that are most often considered the fathers of the two very different methodological and ideological perspectives that are present in computational language analysis (which does include computational approaches to understanding and interpreting literature): J.R. Firth and Noam Chomsky. Chomsky fathered a movement to better understand language by focusing on structure, while Firth aspired to understand language as a behavior. Firth is most frequently known for his unique perspective on language use being context-dependent (or external to a person), while Chomsky has argued for the notion of Universal Grammar, or grammar being hard-wired in the brain (internal to a person). This division also extends to the ways in which these two camps approach meaning. Firth is most often quoted as having said that “You shall know a word by the company it keeps.” In other words, linguistic inquiry from what eventually became known as the “London School” or Neofirthian perspective (linguists like M.A.K. Halliday and John Sinclair) advocates for the investigation of connotative meaning through the analysis of the words the co-occur around a node, or a collocate. Thus there can be many different meanings. While a generative linguistic (or Chomskyan) inquiry would most often focus on finding denotative meaning, the one essential meaning, of a word that could be traced back to the notion of a person’s hard-wiring for language. We can reduce these two perspectives to a focus on structure (Chomsky/Generativist) versus function (Firth/Neofirthian). These two very different manifestations of langue and parole can now be found in the work of modern computational linguists like Bill Kretzschmar and Christopher Manning.
Kretzschmar’s work has been centered on descriptions of language as it is used by individuals: always dependent on context. On the other hand, Manning’s scholarship is centered on establishing the rules and structure of language through which machines can analyze and interpret it. In other words, Manning does computational langue while Kretzschmar utilizes methods based on computational parole. Logically, we call these approaches by different names. Kretzschmar has developed a model for linguistic analysis called the linguistics of speech (or the a method for investigating parole) where language is documented and believed to behave as a complex system. Thus, the behaviors of the aggregate of language samples are always kept in their context by the use of corpus linguistics methodologies. Think of it as maintaining the connection between Close Reading and Distant Reading and traversing the continuum between the two in order to understand language behavior: : i.e. concordance views, collocate analysis, semantic sets, cluster analysis that is linked back to concordances, etc.
Manning is considered one of the foremost scholars in natural language processing. This perspective operates under the assumption that language behaves systematically in a normal or gaussian way with regard to its structure. The structure of language is analyzed and interpreted using automated models and procedures. Referring back to a point just made about the relationship between computational approaches and Close Reading/Distant Reading, NLP is primarily focused on a Distant Reading type of experience and does not necessarily afford scholars the ability to connect each higher-level computational datapoint back to where it actually occurs in the context from whence it came. This is the case when researchers employ the “bag of words” methods where the words within a text are sampled/represented as an unordered collection of words. When this “bag of words” is analyzed, any results are not linked back to their original, contextual behavior. Again, this approach focuses on structure rather than function.
It should also be noted that both of these perspectives utilize machine learning. Neural Network Algorithms and other logarithmically based models are the preference for many scholars who believe language behaves as a complex system (“Neural Networks and the Linguistics of Speech”), while the NLP community frequently uses parsers and taggers like the Stanford Named Entity Recognizer.
I know that this is all very reductive, is an extreme oversimplification, and that there are many flavors, combinations, and shades of gray between these two ideological perspectives. This is in no way meant to be a complete analysis of the divisions within the linguistic community but rather a Cliff’s Notes style version for those individuals who have found themselves bewildered by the notion that there are some very different points of view as to how to analyze language. As you can probably already tell, I’m pretty passionate about helping people to understand that there are many tools and approaches out there for you to use, and I’m even more eager to help people pick the right tools for what they are trying to do. This purpose is at the heart of what I was sharing with my friends in the Stanford DH reading group, as well as what started the discussion that ultimately led to this post. Thus, I would like to end by issuing a call to action for all of us, myself included, who are employing computational language approaches to literature–personally, I do it for the “literature” of the nuclear industry/government-regulated business domain. Those of us who do computer-based text analyses of written or spoken language (whether they are linguistic interviews or literature) have an obligation to understand, acknowledge, and explore the implications of these differences. We need to work to better understand not only the methodologies we use, but also those methodologies we might not choose to use. Moreover, we need to increase our transparency regarding the relationships between the methodologies we use and the ideologies to which those approaches subscribe. The reason for this is that any time we use a particular methodology, there are always limitations and implications of its use. Having more dialogue about why we choose the approaches we choose, sharing the steps of our methodologies, and sharing our raw data are some of the ways in which we can do this–and many of us are doing this already: people using structural approaches like Matt Jockers and Ted Underwood, and those of us focused on function like myself and Heather Froelich, just to name a few (and many more that I would be amazing to see listed as comments to this post). This type of public accountability–or as my boss the Feral Librarian recently called, it “the ethos of sharing”–will help us make sure that our choices are a result of what is best for our data and the types of analyses we want to perform and ultimately result in significantly more profound insights.
I’d like to thank Heather Froehlich and Chris Bourg for their insights and comments while I was drafting this post.
A Method for Measuring “Thanks” Part 2: Scraping Query Results for Analysis in a Collaborative Project
A few weeks ago, I posted the first installment of this series on the methodology behind the “Measuring ‘Thanks’” project that Chris Bourg and I will be presenting at DLF in almost a month. In that post, I covered the query for identifying potential library acknowledgements that we used. Today, I will be covering how to unleash those possible data points from the confining query results page by using a web scraper designed for use with Google Chrome.
The first thing I would like to touch on is that the web scraper discussed in this blog post is not the only one. There are many options for web scraping (and many of them are described in detail here). Since I am writing this post with library folks in mind who may or may not have a programming background, I have decided to demonstrate web scraping using a Google Chrome add-on that not only has an easy-to-use UI, but it has a function to export the results as a spreadsheet to your Google Drive account–making it that much easier to share data for collaboration. If you are interested in learning how to do web scraping directly, via a programming language, I can direct you to two pretty good tutorials:
- If Ruby is your thing, you can check out my colleague Jason Heppler’s blog post on using Nokogiri for scraping, or
- There is as an extremely thorough tutorial on Beautiful Soup, a Python based web scraping suit, at Python for Beginners.
Now, if you haven’t already done so, go ahead and download Google Chrome (click here to download it for your operating system) and the Google Chrome Scraper from the Google Store (click here and follow the directions for getting it installed). Once you have this add-on installed, we can begin the really easy task of unleashing your data from the query results page.
Before we commence with scraping, let’s make this task that much easier on yourself by modifying the search results settings in Google so that you have fewer pages to scrape. On the right side of your results screen, you will see a cog. Click it, and then select “Search Settings.”
On the next screen, you will see numerous options. Let’s focus on the third one, labeled “Results per Page.” If you haven’t already changed this, you will want to move the slider all the way over to the right so that you are displaying 100 results per page. Although modifying this setting will cause Google to display your results more slowly, it really speeds up the scraping process. Once you are finished scraping, if you so desire, come back and change this back to whatever setting you wish.
Now that we have optimized your results display for web scraping, let’s get started scraping! What you will want to do first is to right-click anywhere on the first page of your query results and select “Scrape similar….” What this will do is launch the Google Chrome Scraper.
Once you have clicked “Scrape similar…”, you should see the Scraper interface.
The first thing you will want to do is make sure that you have selected XPath. We will not be doing any of our scraping with jQuery today. All of the web scraping in this tutorial will be done using XPath expressions. However, you have that option if you ever want to harvest data from a website using this tool in the future.
Once you have confirmed that the Selector is set to XPath, you should copy/paste the following reference into the XPath Reference box:
What this expression does is tell the scraper exactly where to go on the query results page. I won’t go into every single aspect of this declaration, but what I will say is that it points directly to each and every one of your Google book results.
Now, we need to identify each piece of information that we want to harvest from the query results page. For the purposes of our study, we obviously wanted the title of each book. Since I have already pointed the scraper to the div container that holds the metadata for each book entry, it is really easy for me to extract the individual pieces of information for each book as an array (which the Google Chrome Scraper does automagically and will be really important when we go to export the data). For our study, we wanted to harvest the title, author(s), year, description (HTML results snippet), and the link for preview in Google Books.
And here are those specific XPath expressions in a format that you can copy/paste into your Scraper:
- Title: ./h3
- Author: ./div/div/div/a
- Year: ./div/div/div
- Description: ./div/div/span
- Link: ./h3/a/@href
The first thing I would like to point out about these XPath expressions for those of you not already familiar with XPath is that they all begin with a period [.]. The Title, Author, Year, Description, and Link XPath expressions are all what we call predicates. And what the period does is tell the scraper that the information we put in the XPath Reference Expression box at the top is the current element to which we would like to append the predicates. Basically, the period [.] saves us more typing. Secondly, you will notice that the Author and Year XPath expressions are extremely similar. The reason for this is that Google renders the Year as the contents of the div that contains a sub-element “a” whose contents are the Author’s name. Basically, Google has made the book’s Year the parent of the author’s name.
Once you have copied/pasted all of the XPath expressions into the Scraper for the data you would like to harvest from the query page, go ahead and click the Scrape button to view your data harvest in the preview window.
Once thing you will notice in your preview window is that the Year contains the author’s name and some other extraneous text. Remember earlier when I pointed out that the div containing the Year data was the father of the a that contained the author’s name? Well, in order to get the year we have to get the rest of that information as well: there is no other way around it. No fear though, you can get rid of that extra information using basic Find/Replace, more advanced Regular Expressions Find/Replace, Google Refine, or any other tool you can think of.
After verifying that you are harvesting the data you actually want, make sure to save this series of expressions as a preset for you to use later. THis way, you don’t have to come back and re-copy/paste everything back into the Scraper. You can do this by clicking the Presets button, assigning a name to this setting (#libthanks Google Books Harvest is our personal favorite), and then hitting Save.
Finally, go ahead and click the Export to Google Docs… button on the bottom-right side of the Scraper window to see your query data transformed into a beautiful spreadsheet in Google Drive. From here, you can begin processing your data: confirming if each book actually contains an acknowledgement for your library, or any other transformation of the data that you desire.
By using the Google Chrome Scraper with its built-in function of exporting to Google Docs, we have the ability to work with others on this dataset in a collaborative manner, as well as transform our results into dynamic visualizations and maps using Google Fusion Tables and Google Maps.
Oh and if you are interested in learning more about the actual results from our analysis of Stanford acknowledgements, come check out our panel at DLF (or watch for a blog post version of that presentation that will likely make its way either to this site or the Feral Librarian).
Until next time….
Update September 25, 2013 –
If you’re interested in seeing how this same process is used with results from a tool other than Google, check out my complementary blog post on the Stanford Digital Humanities website. In that post, I demonstrate the use of Google Chrome Scraper on results from Opening Night! Opera and Oratorio Premieres. This is a Ruby on Rails web application that performs dynamic queries to a SOLR index. If you’re interested in seeing how I unleash metadata about operas and oratorios that were inspired by the works of Shakespeare so that we all can explore the proliferation of his literary influence, check out “Shakespeare Goes to the Opera.”
Recently, Chris Bourg shared in her blog how we will be presenting the findings from our “proof of concept” inquiry into assessing library impact by text mining acknowledgements at DLF in Austin, Texas, November 4-6, 2013. We have been given the challenge to present this paper in only 7 minutes. Needless to say, rather than sharing all of the wonderful, nitty-gritty details about our text analysis/mining methodology during the presentation, I thought it might be nice to outline some of the details of our methodology before we share all of our really fascinating results at DLF: an appetizer before the main course, if you will.
Before we were able to analyze who was being thanked in our library and the nature of those acknowledgements, we had to literally search for acknowledgements. Again, keeping in mind that this was a proof-of-concept endeavor, we decided to limit our inquiry to books and to find them using Google. Through a series of experiments, we discovered that the most robust and effective syntax for identifying candidates for our corpus was the following query in Google Books:
I will walk you through this syntax step-by-step and explain why we constructed this the way we did. First, let’s take a look at the individual operators.
- & can be used interchangeably with AND, and signifies words or expressions that all must be found in the search
- | can be used interchangeably with OR to include more than one term to be found in your search
- “” are used to identify words or phrases that must be found exactly as they are typed
- ~ indicates a word that can be found with flexibility (i.e. librarian in addition to library)
- () require that the terms or expressions found within them to be performed first, and is also known as nesting
Thus, our expression can be read as the following narrative:
The reason we decided to include Special Collections is because we were noticing in our initial, experimental queries that authors would most frequently refer to that part of our library, as well as specific archives. It was this same line of logic that caused us to also include Green and Cecil H. Green along with various combinations of Stanford (University Library/ies).
When I submitted this query to Google Books just now, I received “About 981,000 hits.” Now, for the purposes of our specific inquiry, we decided to limit to the last 10 years, so from 2003 to 2013. In order to limit the Google query results to a specific year range, we need to select “Search Tools.” Next, click “Any Time,” and “Custom Range.”
You will notice that a form with calendar view will appear. This is where you need to input the date range in which you are interested. This tool is extremely useful in that you can specify date ranges down to specific days. Keep in mind that if you just put in a year, like in the picture below, it will perform a search from January 1st of the start year to December 31st of the ending year.
Now, you will find that your search has been limited to the range you have specified.
These results are automatically sorted by relevance. However, you may find it useful to change the sorting to Sort by Date.
This completes Part 1 of “A Method for Measuring Thanks.” The next step in this methodology will cover the automated harvesting of the results of our Google Books query for verification before data/text analysis can commence.
I hope that you have found this little tutorial useful, and be on the look out for Part 2 of this methodological series in the coming weeks.
One of the questions I am commonly asked during my escapades as a linguist working in a research library is, “what kind of linguist are you?” When I answer that I’m a corpus linguist, I generally encounter the follow-up question, “what is that?” After answering this question too many times to count, I thought it might make sense to write a post explaining what corpus linguistics is for those people who are not linguists: in particular for all of my librarian friends.
For many of us, in order to understand an abstract concept like language, we need a visual representation to see, touch, and even manipulate: a model. Through modeling that we “make the best and most productive sense through what we observe” (McCarty 1). In situations where the object of study is abstract, the best method for making explicit the implicit intuition we may have about a particular subject is the use of models.
In the field of language study, corpus linguistics is one modeling methodology that allows the use of “real life language” sampled from the world in which it is used. McEnery and Wilson define corpus linguistics as “the study of language based on examples of real life language use” (McEnery and Wilson 1). However, others like John Sinclair argue that corpus linguistics is more: it is a systematic collection of naturally occurring texts, or “a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research” (2004). Why is it important then to create a model that is composed of samples of real language, texts if you will?
Language, social action, and knowledge all coexist together. In fact, the way in which words are used “can reveal relations between language and culture: not only relations between language and the world, but also between language and speakers with their beliefs, expectations and evaluations” (Stubbs 6). Whether or not we are conscious of it, we have these expectations for the language we use everyday.
Our expectations for language are dependent on our non-linguistic knowledge from the everyday world: “meanings are not always explicit, but implicit. Speakers can mean more than they say” (Stubbs 20). The use of corpus linguistics as a model for learning more about language is rooted in the desire to “develop a theory of meaning (Teubert 1999a, 1999b). If we look for recurring patterns of words as they are used in different contexts in large collections of textual data, then we can have evidence and quantifiable support for our intuitions as to how meaning is constructed through language. The way in which we are able to accomplish this task with corpus linguistics is by evaluating the most basic units of meaning in language: words and phrases and how they occur together.
When it comes to meaning, we often associate the notion of being fixed with denotation, the “cognitive, conceptual, logical, ideational and propositional meaning….the ‘literal meaning’ of a word. However, there is another type of meaning words possess called connotation, “which is also called affective, associative, attitudinal and emotive meaning” (Stubbs 34). These two types are often contrasted such that denotation is usually assumed to be the meaning associated with a word that is stylistically neutral and not dependent on the relationship a speaker or hearer has with the word—the latter association is relegated to connotation. The difference between these two types of meaning is not always distinct, especially when it comes to which one is primary or secondary in the context of how the word in question is being used.
The meaning we each associate with words and phrases in our language does not refer directly to the world around us. Instead, it indirectly points to our notions of what those words and phrases mean, based on our past experiences. For example, when you read the word mouse, its meaning does not come from the combination of the letters m-o-u-s-e, but rather your cognitive representation of mouse emerges from your past experiences, your reality, where this word was used. The meaning you associate with this word might be most strongly connected with something small and furry that you may only want living in your house if it is in a cage, or it may be a peripheral device for your computer. It is both, and you probably have still more meanings, like a name for a ‘black eye” or for a timid person. However, we also associate connotative meanings with a word like mouse. We may have negative feelings for our cognitive representation for this word through our past associations of it with disease or filth. On the other hand, we may have a connotation for mouse that is positive with it in thinking about a certain cartoon character from our childhood. All of these meanings are individual, based on our expectations and past experience in the use of these words, and it is through the co-ocurrence of a word like mouse with dirty versus Mickey that helps us to understand its meaning.
The way in which we are able to make observations about words is through the use of corpora. A corpus is “a collection of pieces of language text [most often] in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research” (Sinclair 2004). Corpus data DOES NOT interpret itself: “it is up to the researcher to make sense of the patterns of language which are found within a corpus, postulating reasons for their existence or looking for further evidence to support hypotheses. Our findings from corpus linguistics are interpretations…” (Baker 18). The reason for this is that corpora cannot explain why. They can only demonstrate what is happening in a language. A corpus cannot tell you why certain patterns occur in language—as we have already discussed, this is where intuition comes into the picture. A corpus can only tell you what happens within it, and with statistics it can help you understand the propensity for those things to happen: “how we know what we know.” We need to marry an explicit model with intuition because we can know more than we can tell (McCarty 2).
So, that’s a pretty brief explanation of what corpus linguistics is to me. Stay tuned for future discussions on how corpus linguistics can be leveraged to do really amazing things in the library, as well as some best practices for creating corpora and analyzing them. For a bit of a teaser on this topic, go check out The Feral Librarian’s blog post “Beyond Measure: Valuing Libraries.” Corpus Linguistic approaches were used to generate the data she shares on mining acknowledgements as a way to assess measure impact.
Until next time…
My name is Jacqueline Hettel (or Jacque–pronounced “Jackie”), and I am so excited to write this inaugural post for A Linguist in the Library. I am a corpus linguist by training: my PhD from the University of Georgia is in English Language Studies where I specialized in Digital Humanities. Currently, I am a Digital Humanities Developer for Stanford University Libraries where I get to do all manner of cool things like create web-based applications for interacting with library resources, consult with faculty on their amazing projects, help train students in DH best practices that they can use beyond graduation, and even *gasp* corpus linguistics (aka text mining and analysis).
This blog is an attempt to share my escapades as a word nerd working in a research library (more details on the journey that led me to pursue an alt-ac career to follow). I love my job, and I feel as though my experiences and insight as a corpus linguist in the library could be useful to many other people. So, if you are looking for tutorials and ideas for how to include more text mining, text analysis, or lexical profiling into your library work, you have found the right place. Looking for commentary and reviews on Digital Humanities resources, tools, literature, and current research? Again, this is the place. You can also often expect to find my own thoughts and opinions about the state of Digital Humanities at large, as well as life as an alt-ac professional, on this site.
I would also like to invite y’all (yes, I grew up in Arkansas) to take part in my adventures as A Linguist in the Library. Please, leave comments! I love them. As a writer and humanist, I love feedback (even if you don’t agree with me). One thing I do ask, however, is that despite how critical your feedback that you keep it to the constructive side of things. If you don’t feel comfortable posting comments to this site, you are also welcome to send me an e-mail at firstname.lastname@example.org. Do you tweet? Follow me @jacquehettel as well. I look forward to getting to know all of you and am excited to share my journey with you.