McEnery, Tony & Andrew Wilson, 1996:

Corpus Linguistics.

Edinburgh: Edinburgh University Press. Pp. x + 209. (Edinburgh Textbooks in Empirical Linguistics.)

ISBN 0 7486 0482 0 (paperback) £14.95 / 0 7486 0808 7 (hardback) £40.

reviewed by

Zoë Boughton

Department of French Studies, University of Newcastle Upon Tyne, NE1 7RU.
Corpus Linguistics is presented on its back cover as 'the first undergraduate course- book for the teaching of a corpus-based approach to language and linguistics'. As one text in a series of Edinburgh Textbooks in Empirical Linguistics , it purports to be an accessible and clearly written introduction for students approaching this field of empirical linguistics for the first time. I read the book not as a specialist in this area, but as someone comparable to an intended reader, having only a vague idea of what corpus linguistics is. Hence I shall attempt to evaluate the text according to the criteria which it sets out to meet: accessibility, clarity and practicality.

In the first chapter, corpus linguistics is introduced as a methodology and situated within the overall programme of linguistic research. It is suggested that all linguistics before the 1950s (that is before the advent of Chomsky) was essentially corpus-based. There follows a clear and reasonably detailed discussion of Chomsky's main criticisms of corpus-based methodologies which led to this type of empirical linguistics being all but discredited for many years. The authors go on to present the case for using corpora and show how modern corpus linguistics has sought to overcome some of the rationalist objections associated with the Chomskyan revolution. Thus the basic conflict between empiricist and rationalist methodologies is introduced in a clear and fairly balanced way; there is a natural bias towards the empirical, as the book is concerned with corpora, but the need for both introspection and observation is stressed.

In the second long and rather technical chapter, the reader is introduced to what a corpus actually is and what it consists of. The corpus is defined here as a body of text which is sampled to be maximally representative of the variety being studied, usually of finite size (some are 'open-ended') and in machine-readable form so that it can be stored and manipulated by a computer. A further distinction is then drawn between unannotated and annotated corpora, the latter being enhanced by the addition of encoded linguistic and extra-linguistic information. Several different types of text annotation and encoding are then presented, illustrated and discussed: extra-textual / 'encyclopaedic' information, orthographic representation, part-of-speech tagging, lemmatisation, parsing, semantic tagging, anaphoric annotation, prosodic transcription, and problem- oriented tagging. The chapter ends with a short discussion of multilingual corpora and their main applications.

The third chapter addresses quantitative issues in corpus linguistics. The relationship between quantitative and qualitative approaches to corpus analysis is briefly examined, followed by a discussion of the importance of representativeness of samples. Some of the main quantitative methods used in working with corpora are then introduced: frequency counts, proportions, significance testing (chi-square), testing for significant collocations (mutual information, Z-score) and multivariate techniques (factor analysis, mapping techniques, cluster analysis, loglinear models). The details of the statistical procedures are not dealt with, as this is not the authors' aim. They do however provide a detailed guide to further reading for those who wish to investigate.

Chapter 4 illustrates the importance of corpora as sources of empirical data in many areas of language study, including speech research, lexical studies, grammatical theory, semantics, pragmatics, sociolinguistics, stylistics, language teaching, historical linguistics, dialectology and psycholinguistics. In each case the authors briefly explain why corpus data are important and how their use may aid progress in that particular field; they also provide concrete examples of this from research in each area.

Chapter 5 is devoted to corpora and computational linguistics, an area not covered in the previous chapter as it requires a more in-depth discussion. Hence we are given a clearly written general overview of the use of corpora in four main areas of natural language processing (NLP), namely part-of-speech analysis, automated lexicography, parsing and machine translation. It emerges that the most common use of corpora in computational linguistics is in disambiguation of various kinds.

The sixth chapter, a case study, is a practical demonstration of how corpora can be used to investigate a particular linguistic hypothesis. The authors present a clear, step-by-step account of the method used to examine the nature of a sublanguage, represented in this case by a corpus of IBM computer manuals. First of all terms are defined and a hypothesis formulated; the corpora to be used are then presented and their annotation and manipulation is described. Consideration is given to the validity and implications of any results of the study, and then the process of analysis of the corpora is shown. Test results are presented and interpreted in a clear and systematic way, and conclusions are drawn and summed up. Though the case study is interesting in itself, the chapter is more concerned with illustrating the method of investigation using corpora than with the actual subject and findings of the study.

In the short final chapter, the authors summarise the past and present state of corpus linguistics, predict an increasing synthesis of rationalist and empiricist approaches to language study and speculate as to how corpora will continue to develop and adapt with respect to size, international concerns, scope and evolving computer technology.

Chapters 1-6 each end with a number of study questions (usually 3-4) and a fairly detailed guide to further reading. Some of the questions are comprehension-based and some are practical corpus-based exercises; all reinforce ideas raised in the text and could usefully form the basis of seminar work. Appendix C provides suggested solutions to the exercises on chapters 1, 3, 5 and 6.

The glossary, at just over two pages long, could perhaps be expanded; why, for example, is dependency grammar (p.46) glossed when functional grammar , in the same sentence, is not? This ties in with another problem, namely that the authors do not state what prior knowledge they assume on the part of the reader. While some statistical terms are explained in the text, others are not, nor are they glossed (e.g. p.64 sample , population ; p.66 standard deviation , tolerable error ). It seems obvious that the reader is assumed to have some knowledge of linguistics, statistics and IT, but it would be helpful to have pre-/co-requisites stated explicitly.

Appendices A and B give details of corpora mentioned in the text and of important corpus manipulation software respectively. There is a kind of virtual appendix located at the Web site associated with the book
(, where electronic resources available include samples of parsed corpora, samples of part-of-speech tagged texts and various corpus-related links; however, the concordancer mentioned on p.153 did not appear to be included.

To get the most out of Corpus Linguistics as a text book, access to corpora, computing facilities and appropriate software would be crucial. Yet on the whole, I feel the book does meet its own criteria: it is an accessible introduction and where details are lacking we are referred to further reading and other texts in the series; it is clearly written and well-structured, with summaries and good sign-posting and it is practical in that at every turn readers are encouraged to get stuck into corpora for themselves.

