The Relevance of Corpora to German Studies


Bill Dodd

University of Birmingham



(received October 1999)

  1. Introduction
  2. Some key terms
  3. Adopting and adapting work on English
  4. Semantic prosody
  5. Descriptive language studies: lexicography and grammar
  6. Language teaching and learning
  7. Translation studies
  8. Critical language studies
  9. Literary studies
  10. The impact of language corpora on our thinking about language
  11. Critique of Saussure and Chomsky
  12. The essays in The relevance of corpora to German studies
  13. Bibliography

1. Introduction

The volume featured in this report (Dodd 2000) presents examples of recent work in German studies by English-speaking scholars working on computerized text corpora of German. To my knowledge, this is the first volume of essays in English devoted to corpus work in German studies. By and large, monographs and collections of essays on language corpora today are dominated by work on English.1 The essays collected here would not normally be found between the same covers. In traditional academic terms, they make rather unusual bed-fellows. However, the traditional compartments into which we are accustomed to put the different aspects of what we collectively do as Germanists have been set aside in this book in order to focus on the rapidly expanding applications of computerized German-language corpora across the spectrum of the discipline as a whole. The common ground for these essays lies in their exploitation of machine-readable text and their commitment to a set of methods and principles which have come to be associated with ‘corpus linguistics’. All the essays in this book are concerned with empirically examining authentic texts or collections of texts, including literary prose, medieval texts, newspaper articles, and texts belonging to a particular register (such as legal documents) or realm of discourse (such as the language of business and management). Some are specifically concerned with language-learning applications, whilst others have a more traditional research orientation. The majority, perhaps inevitably, deal with written language; one, however, reports on a corpus of spoken German. Together, they illustrate the wide range of corpus-related work now being done across the spectrum of German studies, and the growing importance of text corpora to teaching and research. Ten years ago a book such as this would have been unthinkable. Today, no one can seriously doubt that corpora of German will play an increasingly influential role as computer-readable texts of all kinds become widely available.

Constructing large corpora is still beyond the means of most individuals and indeed most institutions. The main source for several of the contributions in this volume are the large text corpora of German held at the Institut für deutsche Sprache (IdS) in Mannheim, which currently run to more than two hundred million words. The intensive work for most of the studies in this book which use the IdS corpora has been done by scholars visiting the excellent research facilities in Mannheim. However, a selection of these corpora can be browsed free via the World Wide Web or, by arrangement with the IdS, the full set can be investigated via Telnet, in both cases using the IdS in-house software COSMAS.2 However, doing corpus-based work does not necessarily mean that one is restricted to corpora created by large research institutions. Four of the studies in this book are based on relatively small corpora specially constructed by academics and/or postgraduate students at universities in the United Kingdom and the United States (Aston, Birmingham, Central Lancashire, and Brigham Young). These have been created either by scanning in text, entering transcribed spoken text, or transferring text which was already electronically stored. Major developments in corpus construction, however, are often joint enterprises, for example the collaboration between Collins and the University of Birmingham to create the COBUILD project,3 and the parallel corpus of European languages constructed by an EU-funded consortium under the LINGUA initiative — reported on in this volume — which runs to several hundred thousand words.4

‘Corpus linguistics’, not surprisingly, has been taken up mainly by colleagues with a background in linguistics, and because of this there may be a perception that it has little in common with, or, worse, is somehow inimical to the critical research traditions in literary and cultural studies. In English Studies, where much of the pioneering work in text processing has been done, such attitudes can still be found, so it would be surprising if they were not also common amongst Germanists. And yet some of the early applications of this technology were in the field of literary studies. ‘Paper-form’ concordances of German literary works began to appear in the late 1960s (e.g. Wisbey 1968), and cover, for example, the Luther Bible (‘Große Konkordanz’, 1979), Wittgenstein’s Philosophische Untersuchungen (McKinnon 1972), Trakl’s poetry (Wetzel 1971), and Kafka’s Der Prozeß (Speidel 1978). Few literary scholars would dispute that these early concordances provided a valuable research tool. A new generation of interactive editions of literary ‘classics’ on CD-ROM is now extending the possibilities offered by these early concordances — for example, Goethe’s Die Leiden des jungen Werther (Goethe 1995) and Kafka’s Die Verwandlung (Kafka 1997). Texts on CD-ROM usually have some kind of keyword search facility for finding the next occurrence of a particular word, and a text-export facility enabling marked text to be exported to a text file which can be investigated by a concordancer.5 The increasing availabilty of literary works in electronic form provides a research tool much more versatile and therefore more powerful than the early paper-form concordances.

2. Some key terms

In this Introduction I will outline some of the main implications of language corpora, and in particular the importance of work already done on English. There are now several helpful introductions to the field, and I will focus here on the work of John Sinclair and Michael Stubbs, who has also done some work on German.6 (For a more detailed set of definitions see, for example, Sinclair (1991: 169—76)).

A corpus is a ‘body’ of naturally produced language, selected according to some design and stored in machine-readable form. It can be investigated by software programs such as concordancers, which typically produce a KWIC (key word in context) file or concordance in which the key word (or node) appears in the centre of the line, as Figure 1 shows. The stretch of language preceding the node is its left co-text, the stretch following the node is its right co-text. These co-texts contain the immediate and less immediate collocates of the node, enabling the study of collocation, ‘the occurrence of two or more words within a short space of each other within a text’ (Sinclair 1991: 170). Sinclair places collocation, in terms of rank, between ‘independent’ word-meaning and ‘dependent’ phrase-meaning: ‘In between these two fixed points is collocation, where we see a tendency for words to occur together though they remain largely independent choices’ (Sinclair 1991: 71). The KWIC file in Figure 1, taken — like all the KWIC files used in this Introduction — from the IdS Bonner Zeitungskorpus (BZK),7 has been sorted alphabetically by the first word to the right of the node. In this particular file, a distinctive patterning also appears one to the left of the node. With a single exception (line 12), the distinction between Vergleich mit and Vergleich zu correlates with the class of word preceding the noun. The fact that the phrase im Vergleich zu is found thirty-four times in three million words, and im Vergleich mit only once, tells us that both forms are attested but that their distribution is very different. We cannot ignore the existence of the marginal pattern, but we can quantify the frequency of its occurrence relative to the more frequent, ‘normal’ pattern. We might be tempted to say that we have discovered a general feature of the language; however, we would need to look in other, and larger corpora, before we could be reasonably confident of such a statement. At the very least, the concordancer has enabled us, or obliged us, to consider empirical evidence.

The computer can sort the file in various ways, for example alphabetically by the first or second word to the right. In this way, recurring patterns of collocation can be captured and made visible. The computer can also record the frequency of occurrence of a given item as a node or as a collocate of a node, as well as the relative frequency of two collocating items with respect to one other. The number of words to the left or right which are considered to contain significant collocations is known as the span. A span of about four words either side of the node is commonly used for English, though there is no reason for this orthodoxy to be taken over in work on German, and indeed a larger span is used on occasions for English. Although corpora are commonly described as consisting of so many ‘words’, corpus linguists distinguish between types and tokens. For example, fifty instances of und in a text are counted as fifty tokens of the same type. Modern software can calculate type-token ratios in a given text or corpus (for the above example, the type-token ratio is 1:50, or 2 per cent). This information can provide an insight into the characteristics of a particular text or set of texts, particularly when we compare the findings with those from another text or set of texts. Some corpora, especially those built for grammatical analysis, are tagged, that is to say some or all items are assigned to a computer-readable category, most typically according to a part-of-speech classification. This exercise, now increasingly automated, makes possible the study of colligation, the patterns in which grammatical categories combine. Software for personal computers is constantly being developed, and a modern corpus tool like Mike Scott’s Wordsmith can perform many sophisticated tasks such as generating word frequency lists and collocation frequency lists. An example of a word frequency list can be found in Peter Roe’s contribution to the volume.8

A corpus is not a random collection of texts.9 Its construction is planned according to some design to produce a body of texts which are in some way representative of, for example, a particular field and/or time. A corpus which aims to reflect the range of usage in English, or German, must not only be very large but be designed to reflect, for example, different kinds of spoken and written language, and regional varieties, in a controlled proportion. Corpora can be historical or contemporary. Having more than one corpus of a language makes it possible to examine the frequency and distribution of particular words, collocations, or other features across different corpora as well as within the same corpus. Comparing a large, general control corpus, for example, with a corpus drawn from a particular register of the language, will help to highlight the specific features of that register (as well as the extent of shared patterning). Parallel corpora contain, in separate compartments, or sub-corpora, original texts and their translations. By aligning the source text and its translation, it is possible to study translation techniques (see below). Comparable corpora, on the other hand, contain texts in different languages which are related in subject matter or genre, for example, but are not translations (see Teubert 1996).10 Much work has gone into developing specialized or domain-specific corpora, which are used for investigating the language of particular defined discourse areas (such as microbiology, European legislation, or learners’ output in a foreign language). This kind of corpus work, focused on language for specific purposes (LSP), is perhaps the most important area not to be represented in this volume.

3. Adopting and adapting work on English

As I have already noted, the major theoretical and practical advances in harnessing the text-processing power of the computer have been made by scholars and teachers working on English, and in particular on English as a Second or Foreign Language — though the pioneering work of the Germanist Roy Wisbey deserves special mention here (e.g. Wisbey 1971). Today, Germanists need to ask whether the features and positions which have been elaborated by the international ‘English language corpus community’ can be taken over ready-made for work on German. I would suggest that the current state of knowledge regarding the ‘large’ issues of a methodological and theoretical nature can largely be adopted when transferring from English to German, though clearly, differences in grammatical structure need to be acknowledged and practical solutions sought for the specific difficulties these raise for analysing a corpus of German. Here, new procedures need to be devised by corpus analysts and, in particular, software designers. The most immediate problems are posed by the fact that German has a more complex morpho-syntactic system than English. Searching for all occurrences (singular and plural) of a noun or all the grammatical forms of a verb is a relatively straightforward matter in English (where a verb can have as few as three grammatical forms: hit, hits, hitting), but a much more complicated task in German. Designing lemmatization software, which will, for example, group the forms Haus, Hause, Häuser and Häusern, or schlaf, schlafe, schlafen, schlaft, schläfst, schläft, schlief, schliefst, schlieft, schliefen, schliefe, and geschlafen as different grammatical forms of the same lexeme, is a complex but necessary task for a language like German. (For an example of such software in use, see Pik Gupta’s contribution to this volume.) The discontinuous realization of some important grammatical constituents, evident in word forms such as ge+schlaf+en, be+gnad+ig+en, also poses problems at clause and sentence level, most obviously in the distance which frequently separates the constituents of the verbal group in longer clauses and sentences. A span of four words will rarely be enough to capture these important syntactic relationships, and the same is true of complex structures such as the extended adjectival attribute, where important adjectival collocates may be several words removed from the noun they qualify. In such cases, it may be necessary to cast the net wider when looking for collocational evidence in German. And there are other problems. Average word length and sentence length in German texts are reputedly greater than for the equivalent text-type in English. (This is actually an impressionistic statement, which could be tested for different types of text.) Such differences could present a problem if we want to align a text in one language with a translation of this text into the other language, especially if the translation uses more, or fewer, sentences than the original. The use of the definite article to mark case/gender relations in German means that we have a problem if we want to examine the use of definiteness/indefiniteness in German, since abstract, non-count nouns such as Zeit, Geld and Liebe, unlike their English equivalents time, money and love, are typically accompanied by a definite article even when they are semantically abstract. Where a concordance of English time would quickly isolate uses of time from those of the time, this time, the times and so on, one would not expect an equivalent file for Zeit in German to reflect these sense distinctions so clearly. Seemingly minor differences in grammatical structure can have large implications. For example, the fact that English it has no direct equivalent in German (which generally insists on grammatical (gender) rather than semantic agreement, using er, sie, and es) means that while it is relatively easy to get an impression of how often an English text or corpus contains pronominal reference to things rather than to people, this is a daunting task for German. Yet such information can be important for the study of text-types and registers (Biber 1998: 73—5).11 Clearly, for some purposes, language-specific strategies (and software) need to be devised for German. However, important though the differences between the languages are, they should perhaps not be exaggerated. ‘Wildcard’ searches (typically using the ‘asterisk’ or ‘ampersand’ operator) will find morphemes and other strings at sub-word level just as easily in German as in English, and a great deal of important collocational evidence in German can be found using the kind of collocational spans commonly used for English. On the whole, then, and without wishing to understate the importance of these differences, the news for the late arrivals from languages other than English is generally positive: much important practical and theoretical work has already been done and much of the time Germanists will not need to invent their own wheel. Today, the debates within the corpus community are a sophisticated and many-faceted reflection of modern thinking about the nature of language, literature, and society, which will be readily recognized by Germanists interested in these same broad questions.

4. Semantic prosody

The study of collocation has led to new insights into the existence of a particular kind of collocational behaviour characteristic of some words. The concept of semantic prosody goes back to observations by Sinclair, for example on the tendency of the verb HAPPEN to be associated with ‘unpleasant things’ (1991: 112). Stubbs demonstrates that the English lemma CAUSE (verb and noun) has ‘a strongly negative prosody’: ‘The most characteristic [collocates] include accident, concern, damage, death, trouble’. He continues:

It only rarely occurs with ‘positive’ collocates: cause for concern is very much more common than cause for confidence. Although many words seem to have such negative prosodies, some words, such as PROVIDE, have positive prosodies. For example, causing work usually means bad news, whereas providing work is usually a good thing. Typical collocates of PROVIDE are from the semantic fields of care, food, help, money and shelter. The most frequent object nouns are aid, assistance, care, employment, facilities, food, funds, housing, jobs, money, opportunities, protection, relief, security, services, support, training. (Stubbs 1996: 173—4.)

Other English expressions discovered to have a similar profile include the phrasal verb set in (Sinclair 1991: 73—5) and utterly (Louw 1993). It is not unknown for sceptical native speakers of English to object that they knew this already. They are almost certainly not being quite honest with themselves. What Stubbs describes may be recognizably English usage, but it is doubtful whether a native speaker could volunteer such information, or, if asked, would come up with such a detailed list of the most frequent or representative patterns. Only a quantitative approach to a record of naturally occurring language can tell us such information with a degree of objectivity not available to our native-speaker intuitions. The observation and description of such prosodic features has only really become possible with the advent of corpora. Louw decribes a semantic prosody as ‘an aura of meaning with which a form is imbued by its collocates’ (Louw 1993: 157), and argues that this phenomenon, where it is found, is so strong that breaks in the prosody are indicative either of irony or of an unsuccessful attempt by speakers and writers to conceal their true feelings.

This raises some fascinating questions. Do semantic prosodies really exist? The evidence from English strongly suggests they do. So do they exist in German? If so, do they correspond to those found in English or are they language-specific? The first fifty lines from a concordance of verursach* (Figure 2) suggest that at least some semantic prosodies do ‘translate’ across language divides. Although some of the KWIC lines do not contain enough context to show the object of the verb (underlined), all those that do suggest that this German verb has a strong, perhaps exclusive, tendency to be used when we want to indicate a consequence which is perceived as unpleasant. Line 29, however, appears to be a counter-example; but if we ask the concordancer to show the whole context it turns out that the cloud-free skies in this instance are harbingers of unwelcome weather conditions:

Eine Hochdruckzone, die vom Ostatlantik über Mitteleuropa zum Schwarzen Meer reicht, bestimmt weitgehend unser Wetter. Ein zunächst auch in höheren Luftschichten wirksamer Hochkeil verursacht größtenteils wolkenfreies Wetter. In der Folge werden jedoch schwache Störungen den Norden der DDR streifen. Die Lufttemperaturen erreichen vielfach Werte um 30 Grad Celsius, beim Übergreifen der Störungen gehen sie vorübergehend auf 25 Grad zurück. Infolge des trockenen Wetters nimmt die Waldbrandgefahr zu.

The ‘span’ needed here to comprehend the full implications of the prosody ‘wolkenfreies Wetter verursachen’ is actually several sentences. This particular textual relationship would probably pass unnoticed were it not for the data-driven formulation of a theory which prompts the analyst to search for more contextual information. Once again, work on English provides the impetus for parallel work on German. An obvious starting point would be to investigate German equivalents of English words which have been shown to behave in this way.12 Further questions suggest themselves: To what extent are semantic prosodies language-specific? Are they ‘universal’? Where they occur, do they tend to be negative rather than positive? How strong is their presence? Are there absolute, exceptionless prosodies, or are we dealing with (strong) tendencies in collocation which can in principle be quantified? In what kind of contexts are such prosodies ‘violated’? Evidently, there is a cultural phenomenon here which is so ingrained in our use of language that we barely notice it until we are confronted by the empirical evidence.

5. Descriptive language studies: lexicography and grammar

The advent of corpus-based studies of English has led, in Sinclair’s words, to ‘the demise of cherished methods and the wholesale revision of many cherished publications’ (1991: 5). This process is already well advanced in English studies, where large corpora such as the COBUILD Bank of English and the British National Corpus at Oxford have revolutionized reference works of English. Almost certainly, the present situation in English presages the not-too-distant future in related disciplines. Corpus evidence is consulted in Durrell’s revised editions of Hammer’s German Grammar and Usage (Durrell 1996: xv and xvii), probably the most enlightened English-language reference work of German to date in this respect,13 and a new generation of reference works is beginning to use corpus evidence.14 The use made of corpora is a matter for debate: some corpus theorists (such as Sinclair) appear to want to banish intuited examples altogether, whilst others aim to strike a judicious balance between attested and intuited examples. This debate between ‘purists’ and ‘pragmatists’ will doubtless have an impact on future generations of reference works. Also, although good monolingual and bilingual dictionaries of German already offer ‘idiomatic’ contextual information, this is typically implicit rather than explicit. As long as this is the case the evidence provided in dictionaries, for example, will be regarded by some corpus linguists as in principle incomplete and suspect. Does it really reflect the typical, the most frequent, the most probable usage of the word?

The shortcomings of the introspective approach are exposed by Luise Pusch in her entertaining analysis of the entries for the letter ‘A’ in the 1970 edition of the Duden Bedeutungswörterbuch (Pusch 1984). The title of her essay, ‘Sie sah zu ihm auf wie zu einem Gott’ (‘she looked up to him as if to a god’), is one of the example sentences in the entry for aufsehen, and one of scores of examples of sexist bias implicit in the collocations and contexts created in these (invented) example sentences. Pusch goes so far as to characterize the dictionary as a clichéd novel in which the male characters play the dominant roles while the female characters either play out domestic roles or act as temptresses and tomboys. Her principal charge against the lexicographical team is misogyny (‘Frauenverachtung’), but her indictment also specifies: ‘Mief, Spießigkeit, Männlichkeitswahn, Pennälermentalität, Obrigkeits- und Schubladendenken’ (‘small-mindedness, bourgeois complacency, obsession with masculinity, schoolboy mentality, hierarchical and stereotyped thinking’, p. 144). Pusch makes a strong case for the dictionary’s underlying bias, citing many examples of sexist stereotyping (e.g. abkehren: ‘Sie kehrte den Schmutz von der Treppe ab’; auskleiden: ‘Sie kleidete sich aus’; Angst: ‘Mit großer Angst erwartete sie seine Rückkehr’; ängstlich: ‘Sie war schon immer sehr ängstlich’). The underlying stereotyping is traced further in the entries for words with no ostensible connection to sexism (e.g. abpressen: ‘Die Angst preßte ihr den Atem ab’). How representative or typical such collocations are of actual language in use can now be tested against corpus evidence. A quick search of the BZK produces nineteen instances of ängstlich*, for example, none of which collocates with the verb sein to produce the structure X ist ängstlich. Only two contexts clearly refer to a female being or appearing anxious (e.g. lächelte sie ängstlich), but then there are two with male referents (e.g. er sah sich ängstlich um). There seems to be a pattern with the word used adverbially, as in ängstlich bemüht/bestrebt/verfolgt, and examples of institutions (e.g. Mitgliedsstaat, Gericht, Gewerkschaften) being or acting ängstlich. These findings can hardly be regarded as definitive, but they already reveal the rather arid fictional character of the invented example Sie war schon immer sehr ängstlich. As Pusch demonstrates, such ‘intuited’ examples also come with their own undeclared ideological baggage. Modern lexicography, increasingly informed by empirical principles, is better equipped to avoid such pitfalls, though equally the same principles dictate that where sexist collocations are attested, these should also be recorded.

Of course, collocation is not a completely new concept in German linguistics. The principles underlying the Duden Stilwörterbuch are collocational and empirical (Drosdowski 1970: v—xiv), in contrast to the ‘ossification’ (‘Erstarrung’, p. v) of traditional dictionaries. Nevertheless, the formal and semantic constraints on combinatorial possibilities, and the criteria for selecting examples, are generally not made explicit. The entry for Vergleich, for example, includes the following examples:

[…] dieser Roman hält keinen Vergleich mit den früheren Werken des Schriftstellers aus; im V. zu/(auch:) mit seinem Bruder ist er unbegabt.

The KWIC file in Figure 1 actually bears out the information given here, but provides much more detail, not least that the phrase im Vergleich zu/mit accounts for the majority of all instances of the word, in one corpus at least, and in quantifying the preference for zu over mit. The fact that zwischen also collocates with Vergleich is not covered in the Stilwörterbuch but is captured in the concordance. Turning to the verb verursachen, we find that the Stilwörterbuch captures its semantic prosody, though implicity:

verursachen <etwas v.>: hervorrufen, bewirken: das Unwetter verursachte große Schäden; Kosten, viel Arbeit, Lärm v.; er verursachte durch seine Bemerkung große Aufregung, Verdruß, Ärger; es verursachte große Schwierigkeiten, seinen Wohnsitz ausfindig zu machen; <jmdm. etwas v.> dieses Problem hat mir manches Kopfzerbrechen verursacht.

If we compare the information in this entry with the concordance for verursach* (Figure 2), we find that once again it is essentially accurate. The subject nouns Unwetter and Problem imply a negative semantics, as do the object nouns Schäden, Kosten, etc. That these nouns are well-chosen as typical subjects/objects of the verb is confirmed by our concordance. But the information in the Stilwörterbuch could be improved in three respects.15 First, a list of the most frequent subject and object collocates could be given, in descending order. On the limited evidence of the BZK this would promote Verlust/e (four occurrences) to the list of typical collocates. Second, an important generalization could be made with explicit reference to the existence and strength of the prosody, so that even apparently ‘positive’ collocations (such as wolkenfreies Wetter) can be predicted and explained. And third, the equivalence between verursachen and the verbs hervorrufen and bewirken could be qualified to point up the differences as well as the similarities in meaning and usage. An initial search of corpus evidence suggests that whilst negative prosodies are associated with both these verbs, they also occur with semantically neutral and even with semantically positive object nouns (see Figure 3) in a way not attested for verursachen. Eight of the forty-one instances of bewirk* in the BZK appear on the face of it to collocate with semantic positives, as do four of the twenty instances of hervorruf*:16

Especially where the prosodic evidence is not clear cut, there is need of a quantitative element in our descriptions of the language. Corpus data could enable a reference work such as the Stilwörterbuch to be developed into a more rigorous dictionary of collocations.17 (A new generation of corpus-based reference works for English, including dictionaries of collocations, descriptions of grammatical categories such as phrasal verbs, and ‘bridge-bilingual’ dictionaries, is already with us.18) In the meantime, the collocational ranges found in the Stilwörterbuch and other dictionaries, and in works on German semantics like Ernst Leisi’s classic study Der Wortinhalt (1975), provide excellent starting points for corpus-based, contextually sensitive lexical research which could include research into semantic prosody, irony, and metaphor.

6. Language teaching and learning

The principle of data-driven description and analysis leads naturally to the pedagogic concept of data-driven learning (Johns 1991, 1993). For Tim Johns, the leading exponent of this approach, the distinguishing feature is ‘the attempt to ... give the learner direct access to the data, the underlying assumption being that effective language learning is a form of linguistic research, and that the concordance printout offers a unique way of stimulating inductive learning strategies’ (Johns 1991). A variation on this idea is ‘reciprocal learning’, in which native speakers of two languages use each other as a resource in studying corpus-based data from both languages (Johns, forthcoming). What is innovative about this methodology is that it transforms the role of the teacher into that of a facilitator and adviser who is also a student and researcher of the language — even if he or she is a native speaker — and shifts the focus in the classroom from ‘language teaching’ to ‘language learning’. From a pedagogical perspective it is not necessarily the quality of the results obtained, but ‘the conscious process of framing and testing our theoretical assumptions’ that is formative (Jappy 1996: 148). Fernandez-Villanueva comments that the advantage of using of a corpus of spoken language (the Freiburg Corpus at the IdS) to study the use of modal particles is that ‘it enables students to concentrate on an interpretative phase during which they get to perceive the function of these elements ... without having to confront their productive use immediately’ (Fernandez-Villanueva 1996: 92—3.). One problem is of course the ‘chaotic’, uncontrolled nature of unedited data, which is suitable perhaps for only very advanced learners. For this category of student one can, for example, devise high-level research tasks moving backwards and forwards between data and reference works (Dodd 1997). For beginning and intermediate levels, however, the customization of corpus data (mainly involving careful selection of examples) requires great patience and skill of the teacher, but produces worthwhile materials. The kind of gapped exercise illustrated in Figure 4, based on the Contexts program developed by Tim Johns (Johns 1997), is in principle easy to produce from concordance files and is probably educationally more productive than a session of teacher-led instruction:

Students can be shown this file and asked to work out which word has been omitted from all contexts. In the process they can find out about the semantics and case government of the preposition gegen, and also the range of English equivalents for such contexts (e.g. against, (at) about, over, towards, on, compared with, and in exchange for). In addition to the semantic and grammatical information which can be gleaned or reinforced from working on such data, the student also encounters authentic contexts and collocational patterns. This, of course, can be a problem. As Brian Farrington remarks in his review of data-driven learning, ‘it is not uncommon to find that you have concocted an exercise that is so hard that you cannot do it yourself’ (Farrington 1996). The job of the teacher or materials designer is to ensure that the data are not too ‘raw’ for the level of ability of the student. If the context contains too many unfamiliar words, or too many complex or incomplete syntactic structures, the danger is that the student will not have sufficient knowledge to use the contextual information adequately. Customization is time-consuming, but necessary. A program such as Tim Johns’s Contexts provides an authoring frame to do precisely this.

7. Translation studies

The nature and status of translation as a (sub-)discipline is changing fast under the influence of corpus-based methods, as a recent collection of essays (Laviosa 1998) demonstrates. For example, it no longer sees itself as merely a ‘sub-field of applied linguistics’, as Mona Baker explains:

translation is a unique form of linguistic and cultural communication, because it involves much more than simply getting to grips with the subtleties and patterning of source and target languages. Indeed, it is so unique and distinct a phenomenon as to merit being the object of an independent discipline: what we now know as translation studies. (Baker 1998: 480)

The relevance of corpora to translation studies is surveyed by Dorothy Kenny in her article on this topic in a recent encyclopedia of translation studies (Baker 1998a: 50—3), in which she observes that whereas other areas of corpus linguistics have traditionally been data-driven, ‘bottom-up’ in their approach, much recent work in corpus-based translation studies proceeds ‘top-down’. In this field, she notes, ‘theorists are interested in finding evidence to support abstract hypotheses’. Amongst hypotheses to be tested against corpus evidence one might mention the simplification hypothesis (that translations tend to simplify the propositional and structural complexity of the original), the explicitation hypothesis (that translations tend to add additional explanatory material, making explicit what was implicit in the source text), and what might be termed the normalization hypothesis (that translations exhibit a tendency towards the norm, for example by avoiding the extremes of register in lexical choices, or even ‘sanitizing’ the original). As a consequence of this theory-driven work, Kenny believes, ‘ongoing research in translation studies may lead to new ways of looking at corpora, just as corpora are already leading to new ways of looking at translation’. Kenny outlines a somewhat different corpus typology in this area from that outlined earlier in this Introduction, in that the term comparable corpus is used to denote ‘a collection of texts originally written in a language, say English, alongside a collection of texts translated (from one or more languages) into English’. A multilingual corpus, as defined by Baker (1995), is composed of ‘sets of two or more monolingual corpora in different languages, built up in either the same or different institutions on the basis of similar design’ (see also Lewis 1998). The definition of a parallel corpus is as explained above; it ‘consists of texts originally written in language A alongside their translations into a language B’. Kenny herself is engaged in a study of sanitization in literary translations from English to German (Kenny 1998), using a parallel corpus of English literary texts and their German translations (at the University of Manchester Institute of Science and Technology). Using the British National Corpus and the IdS corpora as control corpora, she focuses in particular on the translation challenges of semantic prosodies.

For translation between German and English, various types of corpus can be envisaged which would be a useful tool to research: a corpus of L1 texts and their translation (or translations) into the L2; separate sets of L1 texts from each language, which share some common features, e.g. in respect of text-type and historical context; a corpus of ‘natural’ L1 texts and an accompanying corpus of texts translated into the L1. There are important questions which such corpora could help us to answer (or formulate more adequately). These include: How do good translators do translation? Are there any specific or typical characteristics of translated texts, as opposed to ‘natural’ texts? How do two translations of a given text differ? The availability of large amounts of data in English and German in a suitable form (whether in ‘parallel’ or ‘comparable’ or ‘multilingual’ corpora) promises to transform translator training and research into translation, and indeed the experience of translation in many undergraduate programmes. There are now several alternatives to the traditional ‘grammar-translation’ approach, unchanged and unchallenged for decades in many university German departments, an approach which in practice more often than not focuses on a narrow range of privileged text-types and treats them as collections of grammatical and lexical features of the language, for whose explanation students are often reliant on an expert reader who is already well-versed in the text’s various (inter)textual, social, and historical particulars. There may be good reasons for retaining this model, but there is no good reason why it should continue to enjoy an unquestioned monopoly when technology gives us so many ways of accessing banks of ‘natural’ and translated texts across a variety of genres.

8. Critical language studies

Corpus techniques can contribute usefully to what might loosely be termed ‘critical linguistics’ — generally speaking, a discipline which exploits linguistic techniques to uncover institutional and ‘ideological’ factors underlying the choice of linguistic forms. Michael Stubbs, for example, insists that linguistics is a social science, since ‘social institutions and text-types are mutually defining’ (1996: 12). It follows from this that ‘textual analysis is a perspective from which to observe society: it makes ideological structures tangible’ (p. 21). Stubbs’s work is particularly interesting for the way his use of corpus data is informed by these principles. Amongst several practical case-studies contained in his book, for example, is a study of ‘semantic engineering’ in two speeches by Baden-Powell, one his final message to boy scouts, the other his final message to girl guides, illustrating how a relatively simple methodology can reveal the ideological nature of lexical and grammatical choices. Focusing on the occurrences in each text of the lexemes happy and happiness, Stubbs demonstrates what most modern readers of these speeches intuitively sense, namely that they enshrine linguistically a certain view of the sexes which now seems outdated or even offensive. He points out that Baden-Powell’s use of these words is in itself entirely conventional. There are no unexpected collocations. But the pattern of use differs in the two speeches. For example, the collocation make [someone] happy, which occurs six times in the speech to girls (the direct object being others or other people on four occasions, your husband once, and yourselves once) is not found in the speech to the boys, in which the collocates of happy are life, live, die, and be. Only one collocation in the speech to boys (give out happiness) ‘implies that other people are involved’ (Stubbs 1996: 88). The differences in lexical patterning, easily identified from a concordance, are related to a larger, institutional and ideological discourse.

The concept of ‘politicized lexicography’ which follows from this ‘institutional’ approach to texts is framed by Stubbs with reference to Firth’s notion of ‘focal’ or ‘pivotal’ words, and to Raymond Williams’s (1976) notion of ‘keywords’ (Stubbs 1996: 165—72). These are eminent and eminently British patrons. Yet ‘politicized lexicography’ has if anything an even richer tradition in the German-speaking world — perhaps not surprisingly, given the German experience of fascism and Cold War division this century. One thinks of the tradition of political ‘Sprachkritik’ (language criticism) with such brilliant exponents as Karl Kraus, Bert Brecht, and Kurt Tucholsky. More recently, German linguistics has been attempting to accommodate social and political perspectives in the form of a ‘scientifically grounded language criticism’ (‘wissenschaftlich begründete Sprachkritik’, Wimmer 1982). Recent work in Germany includes a critical dictionary of ‘contentious words’ (Brisante Wörter, Strauß, Haß and Harras 1989) and a number of studies by Georg Stötzel and others on contested keywords (Kontroverse Begriffe, Stötzel and Wengeler 1995, cf. Böke et al. 1996). The authors of Brisante Wörter, which is based in part on IdS corpora,19 point out that a serious gap in traditional lexicography is the failure to register the ideological nature of the way words are used in particular discourses (ibid., p. 9f.). The introduction to Kontroverse Begriffe (pp. 1—17) elaborates a similar project, which attempts to write a contemporary history of the German ‘linguistic market’ (‘Sprachmarkt’, p.11) through the history of certain contested concepts and terms in German public discourse since 1945. Stötzel’s method is empirical, based on catalogued instances in the Rheinische Post in which the use of language is itself ‘thematized’ (p. 3) and implicitly or explicitly contested. The changeable and changing use of these vocabulary items, and indeed their power to constitute social reality and influence behaviour, Stötzel notes, proceeds from the arbitrariness of linguistic signs in the ideological marketplace. Only by situating the use of words within the particular historical discourse in which they are used is it possible to explain, for example, how a term such as Bildungskatastrophe can have semantically contrasting interpretations and partake in different discourses, signifying a shortage of teachers in 1964, and a surplus in 1982 (p. 12).

The use of electronic corpora, already evident in Brisante Wörter, has the potential to place the already well-established tradition of critical language studies in German (c.f. also Good 1985, Townson 1992) on a new and more powerful footing. What corpus linguists like Stubbs have to offer here is an exemplary method and a series of case-studies demonstrating how even relatively simple techniques can produce impressive findings, and that ideological values are discernible not just in the more obviously ‘contentious’ words. Stubbs (1997: 157) insists that ‘even the most frequent words, especially in their typical, central applications, express strong cultural connotations’,20 illustrating the point with a corpus-based study of English care and German pflegen. It may well be that German linguists have something to offer in return, for example the carefully elaborated method and the findings of Stötzel’s lexically focused periodization of post-war German public discourse within a historically defined ‘linguistic market’. The prospect of these two traditions coming together and collaborating is particularly exciting.

9. Literary studies

Strictly speaking, an electronic version of a literary text does not of itself constitute a corpus, but it would clearly be perverse to insist on this demarcation dogmatically, since there is obvious common ground and literary scholars were amongst the first to see the benefits of concordances and other forms of computerized text analysis, for example for authorship studies. Nevertheless, it seems to be the case that in ‘language and literature’ academic disciplines there is invariably a divide between those interested in language and those interested in literature. Communication, let alone cross-fertilization, between the ‘two cultures’ tends to be rare. In view of this, colleagues in literary studies may be unaware of, indifferent to, or indeed hostile to the application of corpus-based techniques within their specialism. The position will no doubt be exacerbated by the term ‘corpus linguistics’ to characterize the field as a whole, since this implies that literature is really a branch of linguistics. In important respects, of course, this is true — to the extent that literary scholars are interested in what Wolfgang Kayser famously termed ‘the linguistic artefact’ (Das sprachliche Kunstwerk). But literary scholars would be wrong to view ‘literary linguistics’ as a threat to their discipline, for example because some terms, such as ‘genre’, are reinterpreted within a broader linguistic typology. In reality, this poses no threat to literary studies. On the contrary, much is to be gained from work on text-types and discourse studies which can feed directly into literary criticism. An example which springs to mind is the increasing ‘ideological’ focus on the relationship between literary texts and particular dominant discourses of their time: recent work on Kafka, for example, has focused on how his texts reflect and refract the discourses of gender, ethnicity, and illness which significantly shaped the public discourse of his time.21 Such studies have as much interest as those in critical linguistics in working out theoretical and methodological principles which can establish objectively how such discourses are created, maintained, and challenged by means of lexical and grammatical choices. Collocation and frequency are also likely to be instrumental in illuminating the particular literary qualities of texts (what Jakobson called literaturnost), by enhancing our understanding of the ways in which language is employed in literary texts — the patterns of repetition and variation, conformity to and deviation from norms of usage — as well as how these features compare with those found in other literary and non-literary texts. Such studies could help to illuminate the ways in which, for example, themes, motifs, and narrative voice are organized by an author.

The broader, linguistic view of the literary work implies, amongst other things, that the study of the linguistic aspect of canonical texts should be incorporated into ‘literary linguistics’ — or, conversely, that concepts such as ‘genre’ and ‘stylistics’ should be extended to the study of all text-types irrespective of their aesthetic or genre qualities.22 This suggestion may not find favour amongst some literary scholars, but viewing literary texts as exemplars of particular discourse and text-types, alongside others, is simply an acknowledgment of a fundamental truth. The existence of literary texts in machine-readable form raises the prospect of a new, more intense collaboration between linguistics and literary studies. The corpora available in German are already large enough to allow scholars to begin comparative studies of literary and non-literary ‘control’ corpora.23

10. The impact of language corpora on our thinking about language

Typically, the initial focus of corpus analysis is the individual word or morpheme. In a computerized databank of language, the word (whatever its problematical status in linguistics) is a readily identifiable unit, being simply a string of characters bounded by spaces or followed by a punctuation mark. The implications of corpus studies, however, go far beyond the level of lexical description. It is not an overstatement to say that the advent of language corpora has begun to change our view of language quite dramatically, and particularly the way we approach grammar, lexicography, the study of texts, and the position of language in society. The leading practitioners have in fact made some important contributions to contemporary debates about the nature of language, with some large simplications for all language-based disciplines. Probably the most frequent tenet found in the literature is the need for empiricism on the grounds that native speaker intuitions about one’s language are generally a poor guide to linguistic reality (e.g. Sinclair 1991: 4; Sampson 1996). In other words: We cannot trust our private ‘Sprachgefühl’. My account in this section is particularly indebted to Sinclair, who has arguably developed the theoretical debate further than any other corpus linguist. Sinclair’s dictum that ‘usage cannot be invented, it can only be described’ (Sinclair 1987: xv) informs the vast lexicographic and grammatical work on the COBUILD Bank of English, and throws down the gauntlet to traditional language description. An important statement of his position is to be found in Sinclair (1991), from which the following observations are taken: (i) distinctions in meaning are always accompanied by distinctions in formal patterning in the text (p. 6); (ii) meaning is contextual in the broadest sense, and the meaning of a word is the product of its context, not vice versa; (iii) the more common a particular word, the greater the number of senses it has (and the greater the number of patterns it enters into) (p. 101); (iv) our unreflecting familiarity with our language blinds us to the fact that ‘most everyday words do not have an independent meaning’ (p.108). From this it will be evident that close observation of the way words behave in context leads to some quite radical conclusions not just about the nature of words, but also the nature of texts, and of language itself.

Sinclair also throws into question traditional assumptions about the theoretical distinction between lexis and grammar/syntax. He finds that the physical evidence of ‘collocational attractions’ implies a widespread process of lexical co-selection: ‘if words collocate significantly, then to the extent of that significance, their presence is the result of a single choice’ (p.113).24 Building on these observations, he argues that text is structured at any given point by one of two principles, the ‘open-choice principle’ and the ‘idiomatic principle’. The former is a ‘slot and filler’ model which typically underlies traditional grammar. The latter has been neglected, but the investigation of corpus data shows it to be at least as significant. The idiomatic principle is the result of ‘a large number of semi-preconstructed phrases that constitute a single choice, even though they might appear to be analysable into segments’ (p. 110). Thus, instead of seeing grammatical frames into which lexical items are slotted according to their class membership, we are invited to see clusters of lexical items which, though they may not be physically adjacent, are dependent on a single choice for their particular occurrence in the text. The existence of multiple-word units, at a level between the word and the clause, and which frequently do not occur sequentially, will probably not come as a great surprise to experienced language teachers, who have always taught vocabulary as a branch of ‘idiom’ or ‘phraseology’. The theoretical significance of Sinclair’s argument, however, is far-reaching: traditional language description, he claims, has been guilty of unjustly ‘decoupling’ lexis and syntax (p. 104), thus obscuring the idiomatic principle and ignoring the syntactic constraints on much lexis, and the lexical nature of much syntax. It will be evident by now that having access to large language corpora does not mean that in describing the language we merely replace intuited examples of language in use with authentic ones, while everything else stays the same. The language, Sinclair remarks, ‘looks rather different when you look at a lot of it at once’ (p. 100).

11. Critique of Saussure and Chomsky

It may come as a suprise to some to learn that the standing of two of the twentieth century’s most influential thinkers about language, Saussure and Chomsky, is rather low in corpus circles. This is partly explained by the fact that, as Stubbs (1996: 22—50) points out, the major intellectual tradition in which corpus linguistics has operated is the British empiricist tradition beginning with J. R. Firth and including the work of Michael Halliday, Randolf Quirk, Geoffrey Leech, and John Sinclair.25 This tradition is contextual as well as empiricist: ‘the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously’ (Firth 1935: 37). Its continuation in corpus linguistics has led to a substantial challenge to Saussurian and Chomskyan models of language. Chomsky’s discounting of the empirical data of ‘performance’ in favour of a posited ‘competence’ located in a fictional ‘idealized speaker-hearer’, his focus on abstracted ‘sentences of the language’ rather than actual texts, makes him a marginal figure from the perspective of corpus studies. Stubbs (1997: 154) attacks Chomsky’s lack of concern for a theory of ‘performance’: it is, he remarks, ‘all that remains once we have explained competence. Even so, what remains is language use in its entirety’.26 A particularly interesting account is given by Geoffrey Sampson of his conversion from Chomskyan to corpus linguistics under Geoffrey Leech (‘the best career move I ever made’).27

Saussure’s major precepts are also being questioned as corpus-oriented theorists move ‘beyond Saussurian dualisms’ (Stubbs 1996: 44). Lexical co-selection undermines the strict opposition between Saussure’s syntagmatic and paradigmatic axes. Perhaps the most far-reaching revision, however, relates to Saussure’s fundamental distinction between ‘langue’ and ‘parole’. In Saussurian terms, textual analysis of corpora stands open to the charge that it is merely concerned with ‘parole’, not engaging with the underlying rule-governed system of the ‘langue’. Even if we accept Saussure’s dualism and the the value-judgment inherent in it, this charge looks a lot less persuasive when hundreds of millions of words of ‘parole’ can be systematically interrogated. It may never be possible to cover ‘the whole language’ (if such a thing exists), but as corpora increase in size the likelihood diminishes of some significant feature, such as a structural pattern, remaining unattested. However, Sinclair and others reject the distinction between an ‘abstract system’ (Saussure’s ‘langue’, Chomsky’s ‘competence’) and particular ‘instances’ of the system (Saussure’s ‘parole’, Chomsky’s ‘performance’), as an unnecessary abstraction which perpetuates a misconceived notion of language structure: ‘the main simplification that is introduced by conventional grammar has nothing to do with the purity of abstraction as against the chaos of life. It is merely the decoupling of lexis and syntax’ (Sinclair 1991: 104). By contrast, the task of corpus linguistics is to ‘exemplify the dominant structural patterns of the language without recourse to abstraction, or indeed to generalization’ (p. 103). In this respect, corpus linguistics is one of several recent developments in linguistics (one thinks, for example, of conversation analysis and other areas of pragmatics and critical discourse analysis) to concern themselves with what Saussurians would regard as ‘only’ parole, and to reject not just the value judgment inherent in the dichotomy but the dichotomy itself.

Somewhat paradoxically, whilst Saussure’s insistence on descriptive rather than prescriptive linguistics is essentially empiricist in spirit, Sinclair stresses the need for evaluative selection of data, in which there is an element of subjective input by the researcher in attempting to bring out the typical features of the language from the mass of data. (Some corpus linguists, however, believe that at some point this process will become automated using statistical criteria.28 It is a moot point whether this goal will be attained, or indeed is desirable.) Thus, Sinclair does not, surprisingly perhaps, regard prescriptive studies as taboo. They fall into disrepute ‘only when they ignore or become detached from evidence’ (Sinclair 1991: 61).

12. The essays in The relevance of corpora to German studies

The essays collected in this volume generally resist a neat compartmentalization: they illustrate the wide range of applications of corpus-based methods across the broad spectrum of the discipline.

The papers by Dodd, Gupta, Lawson, and Witton exploit corpus data from Mannheim, in what might be termed corpus-based reassessments of earlier descriptive and theoretical work in the linguistics of contemporary German. Bill Dodd examines the ordering of expressions containing ‘Ost’ and ‘West’, as in West-Ost-Gefälle and Verhandlungen zwischen Ost und West, from three IdS corpora dating from before and during German unification. He finds that whilst the sequence Ost + West is consistently the ‘norm’, the frequency of the ‘minority’ sequence West + Ost increases noticeably in the data from 1989—90. In looking mainly at binomial expressions this study revisits a classic study on binomial reversibility by Yakov Malkiel (1959). Dodd also investigates possible semantic and pragmatic implications of the sequence when it is reversible, keeping in view the possibility that this political key term may also have a role in inscribing a larger, ideological discourse. Some at least of the more unusual examples from the time of the ‘Wende’ appear to come from the political leadership.

Piklu Gupta’s paper uses corpus evidence to revisit and re-evaluate earlier studies of the syntactic and, especially, semantic valency characteristics of be-prefixed verbs in German. Pointing out that the verb alternations most frequently expressed by be-prefixation are realized differently in English, producing contrastively interesting variations in syntactic patterning between translation equivalents, he focuses particularly on the taxonomy proposed by Harmut Günther. Günther’s distinctions and classifications, based largely on intuition and personal observation, hold up well in the face of corpus data. Endorsing Günther’s comments on specialized use of verbs in particular registers, Gupta is led to argue for a greater use of specialized sub-corpora in this kind of investigation.

Nic Witton offers a corpus-based study of the relative occurrence of the periphrastic (analytic) second subjunctive form (e.g. ‘würde kommen’) and its equivalent synthetic form (e.g. ‘käme’) of eight common verbs in corpora of newspaper texts from the 1960s and the 1990s. His aim is to establish whether a case can be made for a change in usage in public written texts over this period. Focusing on these eight high-frequency verbs, and returning to seminal studies on this topic by Siegfried Jäger and Karl-Heinz Bausch, his purpose is twofold: to ‘fill the information gap’ left by these studies by investigating the evidence for a shift towards the analytic form in written standard German; and to investigate the various functions of the two forms. His initial hypothesis, that the analytic construction would have made inroads on the Subjuntive II forms in the intervening years, is not borne out by his findings. Indeed he finds a shift in the opposite direction, which he interprets as evidence of a continuing ‘inherent conservatism in the print media’.

Jan Svartvik has remarked that ‘conversation — the quintessence of spoken language — is either missing or seriously underrepresented in most existing corpora’ (‘Corpora are becoming mainstream’, in Thomas and Short 1996: 10). This fundamental problem needs to be acknowledged, and addressed. The 600 000-word Brigham Young corpus of spoken German is thus a remarkable and valuable asset for examining features of the spoken language as used in conversation. The study by Randall Jones in this volume examines the way the set of ‘dative/accusative’ prepositions are used in this corpus, with some interesting findings — for example that their use in a spatial (locative or directional) sense is the exception rather than the rule. Generally, Jones observes, the distribution of case government is far from equal, even for a given preposition, and the ‘classic’ grammatical explanation of the case distinction (‘wo/wohin?’) is of limited use. The typology of prepositional usage offered in this study, using authentic data, provides some very useful material for language learners. It also provides a useful first step to a comparison of prepositional use in spoken and written German.

April Mackison’s study is based on a corpus of some one million tokens constructed by her at the University of Birmingham, and consisting of whole texts taken over the same time span (1991—94) from two journals, Wirtschaftswoche and technologie + management. Her analysis of the frequency and distribution of the German equivalents of English ‘manager’ (Manager, Leiter, Führer, Chef, Boß) and ‘management’ (Management, Leitung, Führung), key lexical fields in management discourse, forms the basis of a contrastive study which reveals a ‘mirror image’ pattern of distribution in the two publications. She argues that these initial findings represent an important first step in a linguistic study of register variation which she also believes reveals important insights into the different assumptions each periodical makes about its readership.

The paper by Anne Wichmann and Jane Nielsen explores the linguistic means by which ‘contractual modalities’ are expressed in German legal contracts. This study exploits a specially tagged small corpus of selected legal documents, totalling some 25 000 tokens and constructed at the University of Central Lancashire. By tagging implicit as well as explicit expressions of modality, the authors are able to investigate the relative frequencies of the various means by which obligations and rights find expression in these texts. Their findings suggest that the use of modal verbs is actually one of the less frequent modes of such expression. Instead, lexical expressions and lexical verbs used in the present tense predominate, a finding which they argue is in keeping with the implicitly performative nature of this general text-type.

Wichmann and Nielsen’s study has immediate applications for the training of specialist translators, and points up possibilities for future work in this and similar specialist registers. The tremendous potential of corpora for translation studies is also evident in Dorothy Kenny’s paper, which is based on a specially constructed parallel corpus of modern German literary texts and their professional English translations. Collocational evidence drawn from control corpora for English (the British National Corpus) and German (the IdS public corpus) enables her to examine in detail the extent to which creative manipulations of semantic preferences and semantic prosodies by German authors are captured by their translators. Her paper demonstrates the inestimable value of such parallel and control corpora for the teaching and practice of translation.

Two essays are devoted to literary texts. Gordon Burgess’s study of Die Wahlverwandtschaften uses concordance techniques to examine, for example, the use of particular verbs introducing indirect speech in the exchanges between Eduard and Charlotte, and the deployment of leitmotif. He also uses statistical data to compare the novella within the novel with the rest of the novel in general and with Ottilie’s diary extracts, in what he terms ‘an offshoot of authorship studies’. Computer-supported findings, he notes, are not necessarily revolutionary, and one of the interesting features of this essay is the way Burgess pursues the twin objectives of illuminating certain facets of the novel using the computer as an impartial research tool, while commenting on the potential strengths but also the shortcomings of such an approach, which, he notes, always needs supplementing by human intervention.

Ann Lawson’s study exploits a machine-readable version of Thomas Mann’s Joseph und seine Brüder. Coming to the corpus evidence with a close knowledge of this long text, she discovers that her memory is surprisingly corrected by the data, especially her perception of the phrase schöne Geschichte as a central and recurring motif. This leads her to look to corpus evidence to explore an intuitive insight about the way Mann manipulates patterns of language to ‘weave a tapestry of image, irony, and "spielender Geist"’ in the novel. These patterns are explored both locally, within particular sections of the novel, and comparatively, with reference to Mann’s contemporary speeches. She examines the collocational evidence for Mann’s use of the polysemous key word Geschichte (‘story/history’) and relates her findings to Mann’s linguistic strategies in the novel for subverting fascist discourse.

The potential of corpora as a tool to aid students’ foreign-language learning is illustrated in Peter Roe’s account of the Grammar in Context interactive program developed at Aston University as part of a collaborative venture with the University of Coventry. Based on a specially constructed corpus of about 100 000 tokens drawn from German language material for first-year undergraduates at British universities and, to a lesser extent, from A-level examination boards, this program enables students to explore typical lexical and grammatical patterns without resorting to cumbersome metalanguage. The pedagogic philosophy underpinning the design of this material views successful student-centred learning as a combination of meaningful input, focus on regularities rather than exceptions, and a judicious balance between explicit and implicit modes of developing grammatical competence. Roe also reports on a subsequent development of this model, Language inSight, which contains more sophisticated search tools.

Finally, an insight into the principles of corpus construction is offered by Jonathan West in his reports on his work as one of a team of scholars working on the Frühneuhochdeutsches Wörterbuch. In addition to producing the first scholarly dictionary of Early New High German, the goal of this project is to construct a machine-readable corpus of some 500 ENHG texts, and a corpus of some 45 million words is in preparation at Newcastle. His paper describes in detail the complexion of the corpora on which this work is based, their preparation and marking up, and their lexicographical exploitation. Some of the problems encountered, for example the at times uneven distribution of the textual evidence and the lack of a standardized orthography for the German-speaking areas, are also discussed. He points out that the creation of a reliable corpus is particularly necessary for work on a ‘dead language’, since scholars cannot rely on their subjective knowledge of the modern language.

The very diversity of these contributions is itself proof of the potential of corpus-based approaches to contribute to virtually every area of the discipline. It is always difficult, and probably foolish, to predict the future, but it seems likely that in the not-too-distant future the sheer availabilty of these tools and resources will attract more and more researchers and teachers to make use of them. Some will no doubt specialize in corpus methods as a discipline in its own right; most of us will probably be content to use corpora to support our work where it is convenient and useful to do so, though we will probably have to become more numerate and statistically aware if we want to make statements about ‘typical’, ‘representative’, or unusually ‘significant’ findings. So while it is true that the growth of corpus-based studies will of itself generate new areas of enquiry, it is probably the case that for most researchers and teachers corpora will be seen, in the words of one practitioner, ‘as a complementary approach to more traditional approaches, rather than as the single correct approach. In fact, research questions for corpus-based studies often grow out of other kinds of investigations’ (Biber 1998: 9—10.). It is to be hoped that the work presented in this collection of essays will persuade more Germanists to consider what corpora can do for them, and prompt further work using this exciting resource.


Aijmer, Karin and Bengt Altenberg (1991), English Corpus Linguistics. Studies in honour of Jan Svartvik. Longman: New York and London.

al-Wadi, Doris (1994), COSMAS Benutzerhandbuch, Version R.1.3-1. Institut für deutsche Sprache: Mannheim.

Anderson, Mark (1992), Kafka’s Clothes. Ornament and aestheticism in the Habsburg fin de siècle. Clarendon: Oxford.

Baker, Mona (1995), ‘Corpora in translation studies: an overview and some suggestions for future research’, Target 7(2): 223—43.

Baker, Mona (1998), ‘Investigating the language of translation: a corpus-based approach’ in Laviosa (ed.), The Corpus-Based Approach: a new paradigm in translation studies (special edition of Meta): 480—5.

Baker, Mona (ed.) (1998a), Routledge Encyclopaedia of Translation Studies. Routledge: London.

Barnbrook, Geoff (1996), Language and Computers. A practical introduction to the computer analysis of language (Edinburgh Textbooks in Empirical Linguistics). Edinburgh University Press: Edinburgh.

Biber, Douglas (1988), Variation across Speech and Writing. Cambridge University Press: Cambridge.

Biber, Douglas, Susan Conrad and Randi Reppen (1998), Corpus Linguistics. Investigating language structure and use. Cambridge University Press: Cambridge.

Boa, Elizabeth (1996), Kafka. Gender, class and race in the letters and fictions. Clarendon: Oxford.

Böke, Karin, Matthias Jung and Martin Wengeler (eds) (1996), Öffentlicher Sprachgebrauch. Praktische, theoretische und historische Perspektiven. Georg Stötzel zum 60. Geburtstag gewidmet. Westdeutscher Verlag: Opladen.

Botley, Simon, Julia Glass, Tony McEnery and Andrew Wilson (eds) (1996),

Proceedings of Teaching and Language Corpora 1996 (UCREL Technical Papers, Vol. 9), Lancaster.

Cornell, Alan and Ian Roe, ‘A valency dictionary for English-speaking learners of German’, in Steve Giles and Peter Graves (eds), From Classical Shades to Vickers Victorious: Shifting Perspectives in British German Studies, Peter Lang: Bern/Berlin, pp. 153—70.

Dodd, Bill (1997), ‘Exploiting a corpus of written German for advanced language learning’ in A. Wichmann, S. Fligelstone, T. McEnery and G. Knowles (eds), Teaching and Language Corpora. Longman: London, pp. 131—45.

Dodd, Bill (ed.) (2000), The Relevance of Corpora to German Studies. Birmingham: Birmingham University Press.

Drosdowski, Günther (1970), Stilwörterbuch der deutschen Sprache, sechste Anflage, Bibliographies Institute AG: Mannheim.

Durrell, Martin (1996), Hammer’s German Grammar and Usage. Third Edition. Edward Arnold: London.

Farrington, Brian (1996) ‘Data-driven learning: a new horizon for CALL’ in R. Adamson et al. (eds), Ça m’inspire. Mélanges en l’honneur du Professor S.S.B. Taylor (New Directions in French Language Studies). University of Dundee: Dundee, pp. 177—92.

Fernandez-Villanueva, Marta (1996), ‘Research into the functions of German modal particles in a corpus’ in Botley et al.(eds), Proceedings of Teaching and Language Corpora (UCREL Technical Papers, Vol. 9). Lancaster, pp. 83—93.

Firth, J. R. (1935), ‘The technique of semantics’, in Transactions of the Philological Society, pp, 36—72.

Gilman, Sander (1995), Franz Kafka, the Jewish Patient. Routledge: New York and London.

Goethe, Johann Wolfgang von (1995), Die Leiden des jungen Werther, Philipp Reclam jnr.: Stuttgart, Silver Spring, Berlin.

Good, Colin (1985), ‘Aspektkatalog zur Texterschließung’ in Good, Presse und soziale Wirklichkeit. Ein Beitrag zur ‘kritischen Sprachwissenschaft’ Schwann: Düsseldorf, pp. 19—46.

Große Konkordanz zur Luther Bibel (1979), Calwer; Christliches Verlagshaus: Stuttgart.

Jappy, Tony (1996), ‘Investigating grounding across narrative and oral discourse’ in Botley et al. (eds), Proceedings of Teaching and Language Corpora 1996 (UCREL Technical Papers, Vol. 9), Lancaster, pp. x-xx.

Johns, Tim (1991), ‘Should you be persuaded — two samples of data-driven learning materials’ in Johns and Philip King (eds) (1991), pp.1—13.

Johns, Tim (1993), ‘Data-driven learning: an update’, TELL&CALL (1993/2): 4—10.

Johns, Tim (1997), ‘Contexts: the background, development and trialling of a

concordance-based CALL program’ in A. Wichmann, S. Fligelstone, T. McEnery and G. Knowles (eds), Teaching and Language Corpora. Longman: London, pp. 100—15.

Johns, Tim (forthcoming), ‘Reciprocal learning: a practical application of parallel concordancing’.

Johns, Tim and Philip King (eds) (1991) ‘Classroom concordancing’, Birmingham University English Language Research Journal 4: 27—45.

Jones, Randall (1997), ‘Creating and using a corpus of spoken German’, in A. Wichmann, S. Fligelstone, T. McEnery and G. Knowles (eds), Teaching and Language Corpora. Longman: London, pp. 146—56.

Kafka, Franz (1997), Die Verwandlung. Philipp Reclam jnr.: Stuttgart, Silver Spring, Berlin.

Kennedy, Graeme (1998), An Introduction to Corpus Linguistics. Longman: London and New York.

Kenny, Dorothy (1998), ‘Creatures of habit? What translators usually do with words’,

in Laviosa (ed.), The Corpus-Based Approach: a new paradigm in translation studies (special edition of Meta): 515—23.

Kjellmer, Göran (1994), A Dictionary of English Collocations: based on the Brown corpus. Clarendon Press: Oxford.

Laviosa S. (ed.) (1998), The Corpus-Based Approach: a new paradigm in translation studies (special edition of Meta).

Leisi; Ernst (1975), Der Wortinhalt. Seine Struktur im Deutschen und Englischen, (fifth edition). Quelle and Meyer: Tübingen.

Lewis, D. R. (1998) ‘Accessing multilingual texts: evaluating a literary translation using computer-based text-alignment techniques’ in Maschinelle Verarbeitung altdeutscher Texte. Internationales Colloquium 1997. Niemeyer: Tübingen.

Louw, Bill (1993) ‘Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies’ in Mona Baker, Gill Francis, and Elena Tognini-Bonelli (eds), Text and Technology: in honour of John Sinclair. John Benjamins: Amsterdam and Philadelphia, pp. 157—76.

Malkiel, Yakov (1959), ‘Studies in irreversible binomials’, Lingua 8: 113—60.

McEnery, Tony and Andrew Wilson (1996), Corpus Linguistics (Edinburgh Textbooks in Empirical Linguistics). Edinburgh University Press: Edinburgh.

McKinnon, Alastair (1972), Ausgewählte Konkordanz zu Wittgensteins Philosophischen Untersuchungen. Blackwell: Oxford.

Pemberger, Marianne (1995), ‘Konkordanzen bei der mündlichen Reifeprüfung’, TELL&CALL (1995/1): 26—8.

Pusch, Luise (1984) ‘Sie sah zu ihm auf wie zu einem Gott. Das Duden-Bedeutungswörterbuch als Trivialroman’, in Pusch, Das Deutsche als Männersprache. Suhrkamp: Frankurt/Main, pp. 135—44.

Sampson, Geoffrey (1996), ‘From central embedding to corpus linguistics’, in in Jenny Thomas and Mick Short (eds), Using Corpora for Language Research. Longman: London and New York, pp.14—26.

Scott, Mike (1996), Wordsmith (Version 2). Oxford University Press: Oxford.

Scott, Mike and Tim Johns (1993), Microconcord. Oxford University Press: Oxford.

Sinclair, John (ed.) (1987), Collins Cobuild English Language Dictionary. HarperCollins: London and Glasgow.

Sinclair, John (ed.) (1990), Collins Cobuild English Grammar. HarperCollins: London and Glasgow.

Sinclair, John (1991), Corpus, Concordance, Collocation. Oxford University Press: Oxford.

Sinclair, John, et al. (1995), Collins Cobuild Bridge-Bilingual English—Portuguese Dictionary. HarperCollins: London and Glasgow.

Speidel, W. (1978), A Complete Contextual Concordance to Franz Kafka, ‘Der Prozeß’. W.S. Maney and Son: Leeds.

Stötzel, Georg and Martin Wengeler (1995), Kontroverse Begriffe. Geschichte des öffentlichen Sprachgebrauchs in der Bundesrepublik Deutschland, Walter de Gruyter: Berlin/New York.

Strauß, G., U. Haß and G. Harras (1989), Brisante Wörter von Agitation bis Zeitgeist, de Gruyter: Berlin/New York.

Stubbs, Michael (1996), Text and Corpus Analysis. Computer-assisted studies of language and culture. Blackwell: Oxford.

Stubbs, Michael (1997), ‘"Eine Sprache idiomatisch sprechen": Computer, Korpora, kommunikative Kompetenz und Kultur’ in K. J. Mattheier (ed.), Norm und Variation, Peter Lang: Frankfurt/Main, pp. 151—67.

Svartvik, Jan (1996), ‘Corpora are becoming mainstream’ in Jenny Thomas and Mick Short (eds), Using Corpora for Language Research. Longman: London and New York, pp. 3—13.

Teubert, Wolfgang (1989), ‘Politische Vexierwörter’ in J. Klein (ed.), Politische Semantik. Westdeutscher Verlag: Opladen, pp. 51—68.

Teubert, Wolfgang (1996), ‘Comparable or parallel corpora?’, International Journal of Lexicography 9(3): 38—64.

Teubert, Wolfgang (ed.) (1998), Neologie und Korpus, (Forschungen des Instituts für deutsche Sprache, Band 11). Gunter Narr: Tübingen.

Townson, Michael (1992), Mother-tongue and Fatherland. Language and politics in German. Manchester University Press: Manchester and New York.

Thomas, Jenny and Mick Short (eds) (1996), Using Corpora for Language Research. Longman: London and New York.

West, Jonathan (1992—94), Progressive Grammar of German. Authentik: Dublin.

West, Johnathan (1999), ‘A functional-notional grammar of modern German’, in Steve Giles and Peter Graves (eds), From Classical Shades to Vickers Victorious: Shifting Perspectives in British German Studies, Peter Lang: Bern/Berlin, pp. 139—52.

Wetzel, Heinz (ed.) (1971), Konkordanz zu den Dichtungen Georg Trakls. Otto Müller: Salzburg.

Wichmann, Anne (1995), ‘Using concordances for the teaching of modern languages in higher education’, Language Learning Journal 11: 61—3.

Williams, Raymond (1976), Keywords, Fontana: London.

Wimmer, Rainer (1982), ‘Überlegungen zu einer linguistisch begründeten Sprachkritik’ in H.-J. Heringer (ed.), Holzfeuer im hölzernen Ofen. Aufsätze zur politischen Sprachkritik. Gunter Narr: Tübingen, pp. 290—313.

Wisbey, Roy (1968), A complete concordance to the Vorau and the Strassburg Alexander. Edward Maney: Leeds.

Wisbey, Roy (ed.) (1971), The Computer in Literary and Linguistic Research. Cambridge University Press: Cambridge.


1. For work on German see for example Teubert (1998, 1996); also Dodd (1997), Fernandez-Villanueva (1996), Jones (1997), Pemberger (1995), Wichmann (1995). Back.

2. al-Wadi (1994).Website address: For further information contact Dr Doris al-Wadi, . A useful list of currently available corpora of English and software tools can be found in Biber (1998: 281-7). Back.

3. The acronym stands for `Collins Birmingham University International Language Database'. Back.

4. Further information can be obtained from Tim Johns' website: . Back.

5. Although use for academic purposes is normally envisaged, care should of course be taken in all cases to observe the terms of the licence. Back.

6. In addition to the work of Sinclair and Stubbs, see for example: Aijmer and Altenberg (1991); Barnbrook (1996); McEnery and Wilson (1996); Thomas and Short (1996); Kennedy (1998); Biber et al. (1998). Back.

7. Using Mike Scott's and Tim Johns', Microconcord, published by Oxford University Press in 1993 (now no longer available from OUP). Back.

8. Mike Scott, Wordsmith, (Version 2), published by Oxford University Press in 1996. For further information see Mike Scott's homepage: . Back.

9. A random collection of texts is sometimes referred to as a text archive. Back.

10. The terms `parallel corpus' and `comparable corpus' are used differently by many scholars in translation studies. See the section on translation studies later in this Introduction. Back.

11. For a summary of Biber's work on register see Kennedy (1998: 186). Several of the linguistic features used by Biber in his work on register variation in English clearly have no direct equivalents in German, and this raises the question of the extent to which such a typology can readily be transferred from English to German. Back.

12. Another source of data would be studies of German which identify this phenomenon, though they may not use the term semantic prosody. Examples can be found, for example, in Teubert's (1989: 62-3) corpus-based observations on the use of Subvention as a `politisches Vexierwort', and in Kenny's (1998) study of semantic prosody as a problem in translation, which focuses on the negative collocational environments of (British) English giro and the consequent inadequacies of Scheckheft as a translation equivalent. See also note 21 below. Back.

13. West (1992-95) is also based on consultation of a small corpus. West (1999) and Cornell and Roe (1999) report on forthcoming reference works which use corpus evidence. Back.

14. For example, the publications issuing from the Institut für deutsche Sprache, and, increasingly, the major publishing houses in Germany such as Duden and Langenscheidt, are now informed by corpus data. Collins in Glasgow currently have a German corpus of some 80 to 90 million words, shortly to rise to 150 million (personal communication from Horst Kopleck, Managing Editor for German). Back.

15. Cf. Stubbs (1997: 161) for further comments on the collocational information contained in the Stilwörterbuch. Back.

16. The Stilwörterbuch consulted contains no entry for bewirken and suggests an exclusively negative prosody for hervorrufen. Back.

17. See for example Kjellmer (1994) on English. Back.

18. See for example the COBUILD series, including Sinclair (1987, 1990), and the Bridge-Bilingual English-Portuguese Dictionary (Sinclair 1995). There is currently no German-English bridge-bilingual dictionary. Back.

19. Brisante Wörter is based in part on evidence from the IdS `Handbuchkorpora' of 1986 and 1987. Back.

20. `[weil]… sogar die häufigsten Wörter, insbesondere in ihren typischen, zentralen Verwendungen, starke kulturelle Konnotationen ausdrücken'. Back.

21. See for example Anderson (1992), Boa (1996), Gilman (1995). Back.

22. The use of these terms also varies amongst linguists. Stubbs (1996) uses the terms `genre' and `text-type' synonymously, but Biber sees an important distinction, genre denoting `categorizations assigned on the basis of external criteria', and text-type `groupings of texts that are similar with respect to their linguistic form, irrespective of genre categories' (Biber 1988: 70). See also Aijmer and Altenberg (1991: 204-20). Back.

23. There are corpora at the IdS in Mannheim devoted to the works of Goethe (1.4 million words), the Grimm brothers (0.5 million), and Marx and Engels (2.5 million). In addition, the Mannheimer Korpus I (MK1) includes the following works: Heinrich Böll: Ansichten eines Clowns; Werner Bergengruen: Das Tempelchen; Max Frisch: Homo faber; Günter Grass: Die Blechtrommel; Uwe Johnson: Das dritte Buch über Achim; Thomas Mann: Die Betrogene; Erwin Strittmatter: Ole Bienkopp. Website information can be found at Back.

24. What constitutes significant collocation, however, can only be answered statistically. Raw frequencies of lexical items will not do, since not all items in the lexicon have an equal statistical likelihood of occurring in a given text (see Barnbrook 1996: 87-106). Hence the need to consider the relative frequencies of two collocates, and the concept of `upward collocation' (in which the node word has a lower absolute frequency in a given text or corpus than its collocate, e.g. `went back', where we are looking at the collocates of `back'), and `downward collocation', in which the situation is the reverse (e.g `arrived back') (Sinclair 1991: 116). Back.

25. For a review of the Lancaster tradition of corpus linguistics and Leech's contribution, see Jan Svartvik, `Corpora are becoming mainstream', in Thomas and Short (1996: 3-13). Back.

26. `Perfomanz ist alles, was übrig bleibt, wenn wir Kompetenz erklärt haben. Allerdings ist alles, was übrig bleibt, der ganze Sprachgebrauch.' Back.

27. See Sampson (1996). Back.

28. See for example Cyril Belica's account of a strategy to automate the detection of neologisms in a time-phased corpus, the IdS `Wendekorpus': `Statistische Analyse von Zeitstrukturen in Korpora' (Teubert 1998: 31-42). Back.

