A Family Portrait of Computational Language Analysis
In this week’s post, I want to address a question that I have been asked quite often since coming to Stanford University Libraries, and one that made its way into a comment of Michael Widner’s blog post recounting a recent discussion in the DH reading group here at Stanford on Matt Jocker’s new book, Macroanalysis: “I’m intrigued by the division within linguistics that Hettel describes; if there were a link to a blog post or something like that explaining it, I’d be interested in following up and learning more.” I searched for an already existing blog post on this topic, and came up empty. Bill Kretzschmar does discuss this dichotomy in linguistics in Chapter 2 of Linguistics of Speech, titled “Saussure,” so it is from this chapter that I will draw the foundation for a portrait of computational language analysis. However, if this is something that you are interested in, I strongly suggest you read this chapter for yourself–especially if you are wanting more after reading this post.
There are many ideologies at play within the field of linguistics as it exists today, but for the most part we can trace all of them back to the person I like to call the grandfather of modern linguistics: Ferdinand de Saussure. Saussure was a Swiss linguist and semiotician during the late 19th century and early 20th century and is considered by most scholars to be one of the founders of modern linguistics. Even more interestingly, his book on general linguistics from which we learn about his theories for investigating language, Cours de linguistique générale (Course in General Linguistics), was published after his death from the notes of his former students Charles Bally and Albert Sechehaye. In the Course, Saussure is recounted as proposing that there are two faces to linguistic inquiry: langue and parole.
Langue is the linguistics of language structure, for example grammar or language rules and systems. Saussure is attributed as saying that two of the aims of langue are “to describe all known languages and record their history, and to determine the forces operating permanently and universally in all languages, and to formulate general laws which account for all particular linguistic phenomena historically attested” (Kretzschmar 40). It should also be mentioned that Saussure is noted as proclaiming the importance of langue over parole. However, “the choice of choice of linguistic structure is not inevitable, not ‘natural’ in the sense that it corresponds to an inborn faculty or property of a species; it is the nucleus of an argument to create a science of linguistics, one based on a model with particular premises and with a definite arrangement of its variables” (Kretzschmar 44). Moreover, Saussure himself is even noted as saying that that langue is not necessarily more important than parole, but that during his time it was not possible to truly investigate parole (something that technology has since helped to remedy).
Parole is what people actually say: speech, or the individual combinations of words, which can be influenced by the will of the speakers, as well as voluntary physical factors (like phonation). From this perspective, parole (or speech) can be extended to language behavior, how people actually use language: both written and spoken. Ultimately, parole entails the voluntary language acts that result from an individual speaker/writer’s will. So, the linguistics of speech, or the study of language behavior, views linguistic acts as individual and ephemeral: they are always context dependent, and we can better understand the behaviors of groups of these individuals by the patterns that emerge from the aggregate of their spoken or written language use.
The proliferation of Saussure’s ideas posthumously led to some very different approaches to linguistic inquiry in the 20th century. Today, I want to focus on two men that are most often considered the fathers of the two very different methodological and ideological perspectives that are present in computational language analysis (which does include computational approaches to understanding and interpreting literature): J.R. Firth and Noam Chomsky. Chomsky fathered a movement to better understand language by focusing on structure, while Firth aspired to understand language as a behavior. Firth is most frequently known for his unique perspective on language use being context-dependent (or external to a person), while Chomsky has argued for the notion of Universal Grammar, or grammar being hard-wired in the brain (internal to a person). This division also extends to the ways in which these two camps approach meaning. Firth is most often quoted as having said that “You shall know a word by the company it keeps.” In other words, linguistic inquiry from what eventually became known as the “London School” or Neofirthian perspective (linguists like M.A.K. Halliday and John Sinclair) advocates for the investigation of connotative meaning through the analysis of the words the co-occur around a node, or a collocate. Thus there can be many different meanings. While a generative linguistic (or Chomskyan) inquiry would most often focus on finding denotative meaning, the one essential meaning, of a word that could be traced back to the notion of a person’s hard-wiring for language. We can reduce these two perspectives to a focus on structure (Chomsky/Generativist) versus function (Firth/Neofirthian). These two very different manifestations of langue and parole can now be found in the work of modern computational linguists like Bill Kretzschmar and Christopher Manning.
Kretzschmar’s work has been centered on descriptions of language as it is used by individuals: always dependent on context. On the other hand, Manning’s scholarship is centered on establishing the rules and structure of language through which machines can analyze and interpret it. In other words, Manning does computational langue while Kretzschmar utilizes methods based on computational parole. Logically, we call these approaches by different names. Kretzschmar has developed a model for linguistic analysis called the linguistics of speech (or the a method for investigating parole) where language is documented and believed to behave as a complex system. Thus, the behaviors of the aggregate of language samples are always kept in their context by the use of corpus linguistics methodologies. Think of it as maintaining the connection between Close Reading and Distant Reading and traversing the continuum between the two in order to understand language behavior: : i.e. concordance views, collocate analysis, semantic sets, cluster analysis that is linked back to concordances, etc.
Manning is considered one of the foremost scholars in natural language processing. This perspective operates under the assumption that language behaves systematically in a normal or gaussian way with regard to its structure. The structure of language is analyzed and interpreted using automated models and procedures. Referring back to a point just made about the relationship between computational approaches and Close Reading/Distant Reading, NLP is primarily focused on a Distant Reading type of experience and does not necessarily afford scholars the ability to connect each higher-level computational datapoint back to where it actually occurs in the context from whence it came. This is the case when researchers employ the “bag of words” methods where the words within a text are sampled/represented as an unordered collection of words. When this “bag of words” is analyzed, any results are not linked back to their original, contextual behavior. Again, this approach focuses on structure rather than function.
It should also be noted that both of these perspectives utilize machine learning. Neural Network Algorithms and other logarithmically based models are the preference for many scholars who believe language behaves as a complex system (“Neural Networks and the Linguistics of Speech”), while the NLP community frequently uses parsers and taggers like the Stanford Named Entity Recognizer.
I know that this is all very reductive, is an extreme oversimplification, and that there are many flavors, combinations, and shades of gray between these two ideological perspectives. This is in no way meant to be a complete analysis of the divisions within the linguistic community but rather a Cliff’s Notes style version for those individuals who have found themselves bewildered by the notion that there are some very different points of view as to how to analyze language. As you can probably already tell, I’m pretty passionate about helping people to understand that there are many tools and approaches out there for you to use, and I’m even more eager to help people pick the right tools for what they are trying to do. This purpose is at the heart of what I was sharing with my friends in the Stanford DH reading group, as well as what started the discussion that ultimately led to this post. Thus, I would like to end by issuing a call to action for all of us, myself included, who are employing computational language approaches to literature–personally, I do it for the “literature” of the nuclear industry/government-regulated business domain. Those of us who do computer-based text analyses of written or spoken language (whether they are linguistic interviews or literature) have an obligation to understand, acknowledge, and explore the implications of these differences. We need to work to better understand not only the methodologies we use, but also those methodologies we might not choose to use. Moreover, we need to increase our transparency regarding the relationships between the methodologies we use and the ideologies to which those approaches subscribe. The reason for this is that any time we use a particular methodology, there are always limitations and implications of its use. Having more dialogue about why we choose the approaches we choose, sharing the steps of our methodologies, and sharing our raw data are some of the ways in which we can do this–and many of us are doing this already: people using structural approaches like Matt Jockers and Ted Underwood, and those of us focused on function like myself and Heather Froelich, just to name a few (and many more that I would be amazing to see listed as comments to this post). This type of public accountability–or as my boss the Feral Librarian recently called, it “the ethos of sharing”–will help us make sure that our choices are a result of what is best for our data and the types of analyses we want to perform and ultimately result in significantly more profound insights.
I’d like to thank Heather Froehlich and Chris Bourg for their insights and comments while I was drafting this post.