waching my mom go black

What Is Corpus Linguistics?

One of the questions I am commonly asked during my escapades as a linguist working in a research library is, “what kind of linguist are you?” When I answer that I’m a corpus linguist, I generally encounter the follow-up question, “what is that?” After answering this question too many times to count, I thought it might make sense to write a post explaining what corpus linguistics is for those people who are not linguists: in particular for all of my librarian friends.


For many of us, in order to understand an abstract concept like language, we need a visual representation to see, touch, and even manipulate: a model. Through modeling that we “make the best and most productive sense through what we observe” (McCarty 1). In situations where the object of study is abstract, the best method for making explicit the implicit intuition we may have about a particular subject is the use of models.


In the field of language study, corpus linguistics is one modeling methodology that allows the use of “real life language” sampled from the world in which it is used. McEnery and Wilson define corpus linguistics as “the study of language based on examples of real life language use” (McEnery and Wilson 1). However, others like John Sinclair argue that corpus linguistics is more: it is a systematic collection of naturally occurring texts, or “a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research” (2004). Why is it important then to create a model that is composed of samples of real language, texts if you will?


Language, social action, and knowledge all coexist together. In fact, the way in which words are used “can reveal relations between language and culture: not only relations between language and the world, but also between language and speakers with their beliefs, expectations and evaluations” (Stubbs 6). Whether or not we are conscious of it, we have these expectations for the language we use everyday.


Our expectations for language are dependent on our non-linguistic knowledge from the everyday world: “meanings are not always explicit, but implicit. Speakers can mean more than they say” (Stubbs 20). The use of corpus linguistics as a model for learning more about language is rooted in the desire to “develop a theory of meaning (Teubert 1999a, 1999b). If we look for recurring patterns of words as they are used in different contexts in large collections of textual data, then we can have evidence and quantifiable support for our intuitions as to how meaning is constructed through language. The way in which we are able to accomplish this task with corpus linguistics is by evaluating the most basic units of meaning in language: words and phrases and how they occur together.


When it comes to meaning, we often associate the notion of being fixed with denotation, the “cognitive, conceptual, logical, ideational and propositional meaning….the ‘literal meaning’ of a word. However, there is another type of meaning words possess called connotation, “which is also called affective, associative, attitudinal and emotive meaning” (Stubbs 34). These two types are often contrasted such that denotation is usually assumed to be the meaning associated with a word that is stylistically neutral and not dependent on the relationship a speaker or hearer has with the word—the latter association is relegated to connotation. The difference between these two types of meaning is not always distinct, especially when it comes to which one is primary or secondary in the context of how the word in question is being used.


The meaning we each associate with words and phrases in our language does not refer directly to the world around us. Instead, it indirectly points to our notions of what those words and phrases mean, based on our past experiences. For example, when you read the word mouse, its meaning does not come from the combination of the letters m-o-u-s-e, but rather your cognitive representation of mouse emerges from your past experiences, your reality, where this word was used. The meaning you associate with this word might be most strongly connected with something small and furry that you may only want living in your house if it is in a cage, or it may be a peripheral device for your computer. It is both, and you probably have still more meanings, like a name for a ‘black eye” or for a timid person. However, we also associate connotative meanings with a word like mouse. We may have negative feelings for our cognitive representation for this word through our past associations of it with disease or filth. On the other hand, we may have a connotation for mouse that is positive with it in thinking about a certain cartoon character from our childhood. All of these meanings are individual, based on our expectations and past experience in the use of these words, and it is through the co-ocurrence of a word like mouse with dirty versus Mickey that helps us to understand its meaning.


The way in which we are able to make observations about words is through the use of corpora. A corpus is “a collection of pieces of language text [most often] in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research” (Sinclair 2004).  Corpus data DOES NOT interpret itself: “it is up to the researcher to make sense of the patterns of language which are found within a corpus, postulating reasons for their existence or looking for further evidence to support hypotheses. Our findings from corpus linguistics are interpretations…” (Baker 18). The reason for this is that corpora cannot explain why. They can only demonstrate what is happening in a language. A corpus cannot tell you why certain patterns occur in language—as we have already discussed, this is where intuition comes into the picture. A corpus can only tell you what happens within it, and with statistics it can help you understand the propensity for those things to happen: “how we know what we know.” We need to marry an explicit model with intuition because we can know more than we can tell (McCarty 2).


So, that’s a pretty brief explanation of what corpus linguistics is to me. Stay tuned for future discussions on how corpus linguistics can be leveraged to do really amazing things in the library, as well as some best practices for creating corpora and analyzing them. For a bit of a teaser on this topic, go check out The Feral Librarian’s blog post “Beyond Measure: Valuing Libraries.” Corpus Linguistic approaches were used to generate the data she shares on mining acknowledgements as a way to assess measure impact.


Until next time…



  1. What Is Corpus Linguistics? | A Linguist in the... - [...] In the field of language study, corpus linguistics is one modeling methodology that allows the use of “real life …

No shushing here--the linguabrarian wants to know what you think:

%d bloggers like this: