A Family Portrait of Computational Language Analysis

In this week’s post, I want to address a question that I have been asked quite often since coming to Stanford University Libraries, and one that made its way into a comment of Michael Widner’s blog post recounting a recent discussion in the DH reading group here at Stanford on Matt Jocker’s  new book, Macroanalysis: “I’m intrigued by the division within linguistics that Hettel describes; if there were a link to a blog post or something like that explaining it, I’d be interested in following up and learning more.” I searched for an already existing blog post on this topic, and came up empty. Bill Kretzschmar does discuss this dichotomy in linguistics in Chapter 2 of Linguistics of Speech, titled “Saussure,” so it is from this chapter that I will draw the foundation for a portrait of computational language analysis. However, if this is something that you are interested in, I strongly suggest you read this chapter for yourself–especially if you are wanting more after reading this post.

 

 

There are many ideologies at play within the field of linguistics as it exists today, but for the most part we can trace all of them back to the person I like to call the grandfather of modern linguistics: Ferdinand de Saussure. Saussure was a Swiss linguist and semiotician during the late 19th century and early 20th century and is considered by most scholars to be one of the founders of modern linguistics. Even more interestingly, his book on general linguistics from which we learn about his theories for investigating language, Cours de linguistique générale (Course in General Linguistics), was published after his death from the notes of his former students  Charles Bally and Albert Sechehaye. In the Course, Saussure is recounted as proposing that there are two faces to linguistic inquiry: langue and parole.

Langue is the linguistics of language structure, for example grammar or language rules and systems. Saussure is attributed as saying that two of the aims of langue are “to describe all known languages and record their history, and to determine the forces operating permanently and universally in all languages, and to formulate general laws which account for all particular linguistic phenomena historically attested” (Kretzschmar 40). It should also be mentioned that Saussure is noted as proclaiming the importance of langue over parole. However, “the choice of choice of linguistic structure is not inevitable, not ‘natural’ in the sense that it corresponds to an inborn faculty or property of a species; it is the nucleus of an argument to create a science of linguistics, one based on a model with particular premises and with a definite arrangement of its variables” (Kretzschmar 44). Moreover, Saussure himself is even noted as saying that that langue is not necessarily more important than parole, but that during his time it was not possible to truly investigate parole (something that technology has since helped to remedy).

Parole is what people actually say: speech, or the individual combinations of words, which can be influenced by the will of the speakers, as well as voluntary physical factors (like phonation). From this perspective, parole (or speech) can be extended to language behavior, how people actually use language: both written and spoken. Ultimately, parole entails the voluntary language acts that result from an individual speaker/writer’s will. So, the linguistics of speech, or the study of language behavior, views linguistic acts as individual and ephemeral: they are always context dependent, and we can better understand the behaviors of groups of these individuals by the patterns that emerge from the aggregate of their spoken or written language use.

The proliferation of Saussure’s ideas posthumously led to some very different approaches to linguistic inquiry in the 20th century. Today, I want to focus on two men that are most often considered the fathers of the two very different methodological and ideological perspectives that are present in computational language analysis (which does include computational approaches to understanding and interpreting literature): J.R. Firth and Noam Chomsky. Chomsky fathered a movement to better understand language by focusing on structure, while Firth aspired to understand language as a behavior. Firth is most frequently known for his unique perspective on language use being context-dependent (or external to a person), while Chomsky has argued for the notion of Universal Grammar, or grammar being hard-wired in the brain (internal to a person). This division also extends to the ways in which these two camps approach meaning. Firth is most often quoted as having said that “You shall know a word by the company it keeps.” In other words, linguistic inquiry from what eventually became known as the “London School” or Neofirthian perspective (linguists like M.A.K. Halliday  and John Sinclair) advocates for the investigation of connotative meaning through the analysis of the words the co-occur around a node, or a collocate. Thus there can be many different meanings. While a generative linguistic (or Chomskyan) inquiry would most often focus on finding denotative meaning, the one essential meaning, of a word that could be traced back to the notion of a person’s hard-wiring for language. We can reduce these two perspectives to a focus on structure (Chomsky/Generativist) versus function (Firth/Neofirthian). These two very different manifestations of langue and parole can now be found in the work of modern computational linguists like Bill Kretzschmar and Christopher Manning.

Kretzschmar’s work has been centered on descriptions of language as it is used by individuals: always dependent on context. On the other hand, Manning’s scholarship is centered on establishing the rules and structure of language through which machines can analyze and interpret it. In other words, Manning does computational langue while Kretzschmar utilizes methods based on computational parole. Logically, we call these approaches by different names. Kretzschmar has developed a model for linguistic analysis called the linguistics of speech (or the a method for investigating parole) where language is documented and believed to behave as a complex system. Thus, the behaviors of the aggregate of language samples are always kept in their context by the use of corpus linguistics methodologies. Think of it as maintaining the connection between Close Reading and Distant Reading and traversing the continuum between the two in order to understand language behavior: : i.e. concordance views, collocate analysis, semantic sets, cluster analysis that is linked back to concordances, etc.

Manning is considered one of the foremost scholars in natural language processing. This perspective operates under the assumption that language behaves systematically in a normal or gaussian way with regard to its structure. The structure of language is analyzed and interpreted using automated models and procedures. Referring back to a point just made about the relationship between computational approaches and Close Reading/Distant Reading, NLP is primarily focused on a Distant Reading type of experience and does not necessarily afford scholars the ability to connect each higher-level computational datapoint back to where it actually occurs in the context from whence it came. This is the case when researchers employ the “bag of words” methods where the words within a text are sampled/represented as an unordered collection of words. When this “bag of words” is analyzed, any results are not linked back to their original, contextual behavior. Again, this approach focuses on structure rather than function.

It should also be noted that both of these perspectives utilize machine learning. Neural Network Algorithms and other logarithmically based models are the preference for many scholars who believe language behaves as a complex system (“Neural Networks and the Linguistics of Speech”), while the NLP community frequently uses parsers and taggers like the Stanford Named Entity Recognizer.

I know that this is all very reductive, is an extreme oversimplification, and that there are many flavors, combinations, and shades of gray between these two ideological perspectives. This is in no way meant to be a complete analysis of the divisions within the linguistic community but rather a Cliff’s Notes style version for those individuals who have found themselves bewildered by the notion that there are some very different points of view as to how to analyze language. As you can probably already tell, I’m pretty passionate about helping people to understand that there are many tools and approaches out there for you to use, and I’m even more eager to help people pick the right tools for what they are trying to do. This purpose is at the heart of what I was sharing with my friends in the Stanford DH reading group, as well as what started the discussion that ultimately led to this post. Thus, I would like to end by issuing a call to action for all of us, myself included, who are employing computational language approaches to literature–personally, I do it for the “literature” of the nuclear industry/government-regulated business domain. Those of us who do computer-based text analyses of written or spoken language (whether they are linguistic interviews or literature) have an obligation to understand, acknowledge, and explore the implications of these differences. We need to work to better understand not only the methodologies we use, but also those methodologies we might not choose to use. Moreover, we need to increase our transparency regarding the relationships between the methodologies we use and the ideologies to which those approaches subscribe. The reason for this is that any time we use a particular methodology, there are always limitations and implications of its use. Having more dialogue about why we choose the approaches we choose, sharing the steps of our methodologies, and sharing our raw data are some of the ways in which we can do this–and many of us are doing this already: people using structural approaches like Matt Jockers and Ted Underwood, and those of us focused on function like myself and Heather Froelich, just to name a few (and many more that I would be amazing to see listed as comments to this post). This type of public accountability–or as my boss the Feral Librarian recently called, it “the ethos of sharing”–will help us make sure that our choices are a result of what is best for our data and the types of analyses we want to perform and ultimately result in significantly more profound insights.

 

I’d like to thank Heather Froehlich and Chris Bourg for their insights and comments while I was drafting this post.

3 Comments

  1. tedunderwood
    Oct 8, 2013

    Fascinating, Jacqueline. Thanks for answering my question in Mike Widner’s post. I’ll have to follow up in the Kretzschmar chapter and learn more about this genealogy.

    At first blush, I don’t entirely recognize myself here. I think of myself as primarily interested in the way language is inflected by social context (e.g. by genre and historical change). I haven’t spent much time thinking about Chomsky’s sort of grammar. I thought I was inheriting bag-of-words methods mainly from the information-retrieval tradition in CS; to the extent that those methods had a theoretical justification in linguistics, I thought it was “the distributional hypothesis,” more associated with Firth than with Chomsky. But I may not fully understand the history of the disciplines involved here.

    The part of this I more easily recognize is a disciplinary tension between different sizes of contextual window. In my experience linguists tend to be interested in smaller windows (e.g. collocations or sentence/paragraph-level associations), and people working in IR or KDD tend to be more interested in larger windows (e.g. document-level bags of words). (Not that one has to make a firm either-or choice between those options.)

    In any case, I also understand that this is a family portrait rather than an individual sketch, and the parts of this I don’t yet understand may reveal historical connections between CS and linguistics that I should explore further. Many thanks!

    • Jacqueline Hettel
      Oct 8, 2013

      Ted, thanks for the great comments. One thing I want to address is your comment about disciplinary tension between size. While this may be the case, size is definitely an implicating factor for methodology choice for linguists: there are those language researchers who prefer looking at small windows, there are those who want to look at bigger picture, and most importantly there are those of us who want to do both!

      For example, my dissertation research was interested in what you describe as a larger window. I was interested in looking at the differences in meaning as it manifested in different contexts across documents in the nuclear power industry. I looked at how Observed Meaning varied with regard to geography and industry group (social/industry factors)–and guess what, it does! The great thing about using the computational stylistic/corpus linguistic approach for my study was that I was able to travel seamlessly from the large window (document/regional/industry group scale) to the small window (word clusters within an individual sentence inside of an individual document) to really understand how this was happening.

      Anyway, all of this writing/discussion has me thinking that maybe it would make sense to write something up that builds on this family portrait and provides more of the historicity and specificity. Thank you so much for asking the question that inspired this post. It is creating a lot of really great discourse re: the methodologies we use for text analysis.

    • Jacqueline Hettel
      Oct 8, 2013

      Also, IR is based on NLP theory. This is extremely explicit in An Introduction to Information Retrieval (co-authored by Manning). IR treats language/information from a structural perspective, present or not, through the use of Boolean Logic. I believe term weighting and computation of relevancy scores/relevancy algorithms are based on this approach. We can link this directly to Generativist approaches where it is assumed that language is hard-wired in the brain, where meaning and use are found to occur systematically in “correct” patterns. Thus, the bag of words approach for analyzing documents is totally fine from that perspective, because you’re just looking for presence/not and aren’t necessarily needing to get back to the individual instances.

      Corpus linguistic methodologies are useful for those of us who know language behaves as a complex system (most likely ~20% or less of your samples/population will have the “on” presence needed for Boolean Logic) and thus we need to preserve context to better understand the majority of the samples/pop. We are able to do this by preserving the relationship, the original context of the language. Biber’s analysis of genre is one of the first/primary examples of corpus approaches to genre analysis. However, this is something a lot of us schooled in Neofirthian approaches do even as graduate students: I’ve done it myself looking at differences in narrative structure, Heather Froehlich does it with Early Modern London plays, and the list goes on.

      So, while your analytical perspective is context-based (genre, historical change), your methodological approaches are not respecting the context of the texts you analyze: in fact, NLP/IR-based methods completely divorce the language of the literature out of its original context. I would be interested in seeing what a computational stylistic/context-respecting methodology would bring out of your data.

      You’re right that this does deserve/need more investigation, but I wanted to just put that out there. Thanks again.

No shushing here--the linguabrarian wants to know what you think:

%d bloggers like this: