The British National Corpus: facts and figures | Oxford Learner's Dictionaries

What is a corpus?

A corpus is a large collection of written or spoken texts, held as a database that can be searched to show all the instances of a particular word and the contexts in which it is used.

The BNC is very, very big

  • The BNC contains over 100 million (100,106,008) words of modern English
  • It took 4 years to build.
  • It comprises 4124 texts
  • There are six and a quarter million sentence units in the whole corpus.
  • Each word is automatically assigned a part of speech code- there are 65 parts of speech identified.
  • It occupies 1.5 gigabytes of disk space- the equivalent of more than 1000 high capacity floppy disks
  • The whole corpus printed in small type on thin paper would take up 10 metres of shelf space.
  • Reading the whole corpus aloud at a rate of 150 words a minute, eight hours a day, 365 days a year, would take nearly 4 years.

The written corpus

90% of the BNC is written language

The written part is made up of:

  • 60% books (academic books and popular fiction)
  • 25% periodicals (regional and national newspapers, specialist periodicals and journals for all ages and interests)
  • between 5 and 10% other kinds of published material (brochures, advertising leaflets, etc.)
  • between 5 and 10% unpublished material (personal letters and diaries, school and university essays, etc.)
  • less than 5% material written to be spoken (political speeches, play texts, broadcast scripts, etc.)

The spoken corpus

10% of the BNC is spoken language

The spoken part is made up of :

  • 50% transcriptions of natural spontaneous conversations
    • 124 volunteers living in 38 locations across the UK recorded all their conversations for 2-3 days.
    • There were equal numbers of men and women, approximately equal numbers from each age group and equal numbers from each of four social groupings.
  • 50% transcriptions of recordings made at four specific types of meeting or event:
    • Educational and informative events (lectures, news broadcasts, tutorials)
    • Business events (sales demonstrations, trades union meetings, job interviews)
    • Institutional and public events (sermons, political speeches, council meetings, parliamentary proceedings)
    • Leisure events (sports commentaries, after-dinner speeches, club meetings, radio phone-ins)

Using the corpus

As lexicographers we would hate to be without a large, well-balanced corpus. It gives us an invaluable picture of the way words are really used today. We use the BNC to confirm our intuitions and also to tell us things we didn't already know, or may not have thought about. We can find out exactly what a word means, rather than what we think it means. We can see how it behaves grammatically and which words it collocates with. We use all this information when writing our learners’ dictionaries.

For example, look at this extract from the BNC in which ‘bent’ and ‘on’ have been searched for together. Clicking on it will open the full extract in a new window.

Extract of BNC - click to open larger version

The concordances tell us that ‘bent on’ can be followed by a noun or noun phrase, or by verb+-ing. The lines also show clearly that many of the things somebody is bent on or bent on doing have something in common. Can you see what it is?

Answer: They are often negative (destroying, destruction, creating hell on earth)

Here is the entry in the Oxford Advanced Learner’s Dictionary that uses this information:

OALD 8 entry for 'bent'