The Sound System of Language

Vivian Cook 
Cook's Consonant Test

L2 pronunciation

From Inside Language (1997), which is now out of print and slightly dated: this is the prepublication doc file. A short better-looking version is here. Sorry the pages don't work very well in Dreamweaver (to put it mildly).

Why do people speaking English and French sound so different when the languages look so similar in writing, apart from a few accents? This chapter looks at the some of the properties of sound systems that languages have in common and some of the ways in which they differ. This chapter starts by discussing the ways in which languages use the pitch variation of intonation, goes on to the mechanisms by which speech sounds are produced, and then turns to how sounds are organised in speech. Its theme is the diverse ways in which languages make use of the same resources for producing speech.

The pronunciation of English is taken as a starting point. However, this decision immediately raises the problem of selecting one out of the many accents of English. Most textbooks choose the variety of English called RP, originally taken from ‘Received Pronunciation’, now mostly known simply by the initials RP on their own. Usually RP is thought of as the accent of educated British speakers of English, which does not vary according to region within England, though it differs from other world accents such as General American English. RP accent is then a different concept from the standard language of English, which has a consistent grammar and vocabulary almost everywhere it is spoken, apart from a few well-known local peculiarities.

Choosing RP as a reference point is not to suggest that more than a small minority of people in England speak it, on one estimate 3%. In general the majority of educated speakers in England nowadays have a ‘modified RP’ with some regional forms. The Today news programme on BBC Radio 4 for instance has a range of English speakers each day. In a recording of a typical programme, although about two-thirds of the speakers were within the general bounds of RP accent, most of them had some regional forms. The continuity announcer for instance said fastest with the short ‘a’ sound of lass / / rather than the long ‘ah’ sound of last /a:/, showing she probably comes from north of an east-west line from Wales to the Wash, rather than from the South.

The weatherman pronounced throughout with an initial ‘f’’, showing a feature often associated with a London accent. Variations in English accent will be dealt with in Chapter Seven.

RP is used as a reference point, not because of its intrinsic superiority to other accents, but because everyone is familiar with it, in the rest of the world as well as in England, particularly through the media. It has been used as the model for teaching English as a Foreign Language in countries where British English is taught rather than American. RP also carries social prestige, as popular terms for it such as ‘BBC English’, and ‘Oxford English’ suggest. For centuries versions of RP have been the accent of a particular ‘class’ rather than a particular region, the south of England having an increased proportion of RP speakers because of its comparative prosperity. The social pressure in the UK is still towards the RP accent, as numerous newspaper discussions and pronouncements by Ministers of Education proclaim each year. Successive generations of British students have assured me that they have nothing against regional accents - but most of them say so in modified RP accents that belie their good intentions. Hence, other things being equal, RP is still the most useful accent to describe. More practically, however, RP has been studied in greater depth than any other accent and so more is known about its peculiarities.

1. Sounds and Meanings

This section looks at whether there is a logical link between the sounds of speech and meanings or whether the relationship is purely arbitrary. In a sense there is no reason why a particular speech sound should convey a particular meaning; a rose by any other name would indeed smell as sweet. But could Romeo in fact be called Fred? Juliet Gertie? Or Gertrude Tracey? Partly the naturalness of the names is due to our long familiarity with the characters these portray. What could Gussie Finknottle be called instead of Gussie Finknottle? Anna Karenina? James Bond? Irma Prunesqualor? John F. Kennedy? Why do rose-growers prefer to name roses Ingrid Bergman or Queen Elizabeth rather than Olive Oyl?

However, the sounds in a handful of words do fit their meanings better than chance would suggest. Take the following two objects:

Object A  Object B

Fig.1 Names for arbitrary shapes

Suppose that in an unknown language one of these objects is called a pling, the other a plung: which is which? 71% of the Inside Language panel thought Object A was a plung, only 25% calling it a pling (Q4). Something about the vowel ‘u’ conveys a sense of ‘large and dark’—plung: something about the vowel ‘i’ conveys a sense of ‘small and light’—pling.

Several English words with low vowels like ‘u’ or ‘a’ indeed suggest something ‘large’ or ‘dark’: huge, large, vast, enormous, humongous. Others with ‘high’ vowels such as ‘i’ mean something ‘small’ or ‘light’: little, teeny, wee, titchy, mini, pygmy, Lilliput, thin, itsy-bitsy teeny-weeny polka-dot bikini.. In French ‘small’ is petit, ‘big’ is grand, in Greek mikro and megalo; in Spanish chico and gordo, in Dyribal, an Australian Aboriginal language, /midi/ and /bulgan/, and in Mandarin Chinese xiaoé and daç .

John Ohalla has explained this phenomenon through the ‘Frequency Code’. Vowels with high frequencies such as ‘i’ go with small size and sharpness—pling. Vowels with low frequencies such as ‘u’ go with large size and softness—plung. The Frequency Code hypothesis claims that low sounds in general go with aggressiveness and assertion of power, not just vowels; Margaret Thatcher is believed to have had speech lessons to deepen her voice. The Frequency Code applies across languages, and indeed across species; dogs threaten with a low-pitched growl, but submit with a high-pitched yelp. However, in practical terms, the Frequency Code difference between high and low vowels works for only a handful of words in each language. It is after all contradicted by the vowels in the very words big and small themselves!

Not only individual sounds but also certain combinations of sounds tend to go together with certain meanings. The ‘sn’ combination in English often suggests something to do with breathing noises and the nose: sniff, sneeze, snout, snot, sneer, snuff, snorkel, snooze, snore, snicker, snivel, snort, snuffle. A person who goes round with their nose in the air might be snooty, snub people, and a bit of a snob. Other combinations in English that suggest particular associations are:

A group of ‘ion’ words is popular in the rhymes of reggae songs, revolution, generation, appreciation, consideration, nation, though it is not clear why this sound combination should be attractive. Indeed academic speakers often share this preference; a recent talk emphasised ‘marketisation’, ‘negotiation’ and ‘theorization’.

Some idea of which sounds English speakers consider pleasant and unpleasant can be obtained from the names that science-fiction writers invent for aliens. Classical hostile aliens have short aggressive names like Kryptons, Vatch, Rull, Glotch or Perks. Neutral-seeming or friendly aliens have polysyllabic names like Alaree, animaloids, Osnomians, Voltiscians and Eladeldi.

Nevertheless, despite the evocative role of sounds in poetry, or indeed television advertising, direct or indirect links between sound and meaning are rare, as those who try to learn another language soon discover. Given the Persian words mard and zan, can you tell which means ‘man’ and which ‘woman’? In fact mard is ‘man’, zan ‘woman’. Given the Japanese words minikui and kirei, which means ‘ugly’ and which ‘beautiful’? In fact kirei is ‘beautiful’. The vast majority of the sounds of language have no necessary connection with the meaning of the words in which they are used.

Animal Noises

One of the areas in which speech sounds might be expected to be closest to their meaning is when they are linked to actual noises. Here is a sample of familiar domestic and barnyard noises as portrayed in different languages:

  cats  dogs  sheep cows   roosters
English  meow  woof woof  baa  moo  cockadoodloo
Japanese  nyanya  oue-oue  mee  moo  kokekokou  
Persian  meyu  wag-wag  ba? ba?  mâ mâ  gogoligogo
Hokkien  meow  wo-wo  meeehh  moo  kok-kok
Thai  meow  bog-bog  bae  mor  ek-i-ek-ek 
Greek  yiau  jav jav  bee  muu  cucuricuu
Spanish  miau  guau-guau  mee  muu  quíquiriqúi
Dholuo  ywak  guu guu  meee  mboo  kokorioko
French  miaou ouah ouah   bêêêh  meuh  cocorico
Korean  yow-ong  mong mong  meh-eh-eh  um-meh  cork-eeyo
German  miau-miau  wau-wau  baa-baa  muh-muh kikeriki

 

Note: as these are written in the Roman alphabet rather than a phonetic script, they do not represent the spoken sounds adequately, only the written forms.
Source: mostly Essex University students and staff

2. Intonation and its function

[For examples of tones see Youtube]

However, despite the emotional overtones conveyed by a handful of speech sounds, the aspect of sound that is most associated with emotion is the intonation pattern—the way that the pitch of the voice rises or falls. Most people probably do not even consider intonation to be part of the sound system of language; it seems the natural way of speaking and conveying one’s attitudes. Yet this resource is used in very different ways in different languages, which can cause serious misunderstandings between their speakers. This section looks at part of the intonation system of English, comparing it with the system in languages like Chinese.

The nuclear tones of English

The main element in intonation is the chief change of pitch, called the ‘nuclear tone’. Let us start with the nuclear tones of RP English, using monosyllabic words. One possibility is for the pitch to start at a high level within the speaker’s normal range and to fall to a low level over one or more syllables.

This is called a ‘high-fall’ tone. A high-fall tone often makes the speaker sound interested and involved in what he or she is saying:
A: Would you like a coffee?
B : Yes!

Alternatively the pitch may rise from the bottom to the top of the speaker’s range—a high-rise tone.

This might be an incredulous repetition:
A: Well he proposed and I said yes.
B: ä Yes?

There are also a pair of rise and fall tones in the lower part of the speaker’s pitch range. A low-fall starts from the middle of the range and falls to low, as in yes:

This often sounds definite and serious:.
A: Am I all right, doctor?
B:Yes.
A low-rise on the other hand starts low and rises to the middle of the speaker’s range, as in yes
:

This sounds cool and perhaps indifferent:
A: Can you help me?
B: ä Yes?

English has two more tones that change pitch direction rather than going continuously up or down. One is the sceptical-sounding fall-rise.

       y    s
          e

A: Do you agree?
B: Yes. There may be
The other is the enthusiastic-sounding rise-fall :

    e

y       s

A: Do you want to go to the Bahamas for Christmas?
B: Yes!

Finally there is a level mid or high tone, which occurs less frequently than the others and is typically used for calling people, as in –cooee.

          c o o e e

This might be a mother calling a child from the garden.

Needless to say, a full account of English needs many other features of intonation than these seven tones. In particular it needs to describe the system for choosing which syllable to put the tone on, such as the difference between Susan liked Joan, Susan liked Joan, and Susan liked Joan. Nor is there agreement that seven tones is the actual number required, some linguists reducing the seven tones to a two-way contrast between rise and fall, some elaborating them with additional patterns, such as the fall-rise-fall or the mid fall.

The ways in which speakers use these tones is even harder to pin down. A particular tone often goes with a particular grammatical type of sentence. Jones with a high-fall is an answer to a question:
A: Who’s that?
B: Jones.
Jones with a low-rise is a polite ‘checking’ question:
A: Jones? You’re next.

Sometimes, however, the tone forces the other person to reply in a particular way, as in ‘tag’ questions such as did he? or aren’t you? added to sentences in conversation. You’re Peter, aren't you? has a high-fall on the tag aren’t you which invites the person addressed to agree. You’re Peter, aren’t you? with a high-rise on the tag aren’t you? leaves it open to them to agree or disagree.

Falling tones often suggest that what the person is talking about is accomplished fact. Consequently, in most languages, falls tend to be used at the end of the sentence. Rising tones suggest on the other hand that what the speaker is saying is still open to debate and so they are less common towards the end of the sentence. English often shows continuity, i.e. that the sentence is not ending, by using a low rise for phrases or clauses within the sentence and a high fall to finish the sentence off:

On Tuesday I went to London

has a low rise on Tuesday at the end of the phrase but a high fall on London at the end of the sentence.

In many languages, rises are associated with questions, that is to say with uncertainty; indeed the chief way of making a question in Greek and Portuguese is to use a statement with rising intonation. The almost universal tendency in human languages to regard falling tones as final and rising tones as tentative has been linked to the Frequency Code. In one experiment 92% of listeners thought a steep final fall was more dominant. The moral is that speakers who want to sound authoritative should not use many rises or an overall high pitch.

In RP English, statements do not usually have rising tones. It is almost inconceivable to introduce oneself with a rising tone - My name’s Peter rather than a falling tone My name’s Peter - as the rise would suggest doubts about one’s very existence. Other accents of English, such as Belfast, Tyneside and New Zealand, however, use rises for factual statements without overtones of doubt. Indeed such rises with statements have started to be used by teenagers in Australia and the USA, and by some twenty-year-olds in London.

Often, however, English intonation has more to do with the speaker’s attitude to what they are saying than with the sentence type. The explanations of the different intonation patterns of yes given above took advantage of this feature by ascribing particular emotional qualities to tones—interested high-fall yes, incredulous high-rise yes, and so on. These emotional labels work reasonably well with some aspects of English intonation; yet clearly there is far more to expressing emotions than the choice of tone.

These patterns can be regularly heard in news bulletins. A typical news on the local radio for example had mostly falling tones, reflecting the factual nature of the news, ranging from the excited high fall Four people died! to the matter-of-fact low fall Tonight’s top stories. There were also examples of the rise-fall used for enthusiasm Hicks made an unbeaten century, and a succession of fall-rises used to show warning or emphasis Rush hour roads the latest on the fires raging around Essex for the third day running. The low rise also occurred several times in the middle of the sentence, to link the parts together. Two of the above examples were preceded by low rises: England are bowled out for four hundred and forty but Hicks made an unbeaten century—low rise on forty, rise fall on century; Investigators investigated a light airplane crash in Hampshire in which four people died—low rise onHampshire, high fall on four. Even if speakers of English are unconscious of it, the choice of intonation is crucial to the meaning they convey to others.

Intonation across languages

When learning another language, the emotional overtones of intonation often create difficulties. Speakers use intonation without thinking and it seems to them the natural way of expressing their emotions. Though there are similarities across languages such as the rise/fall distinction, each language has an intonation system of its own. Certain intonation patterns are dangerous in that they convey something different in another language.

A person learning a second language may produce a perfectly plausible intonation for the new language but convey the wrong emotion. An overseas student used to say Good bye to me with a dramatic high-fall on bye suggesting that she had been mortally offended. Similarly thank you with a low rise may suggest that L2 speakers are casual and offhand when it is simply the normal polite intonation for saying ‘thank you’ in their own language.

While there are many other attributes to national stereotypes, intonation certainly makes a contribution. To the English, for example, Germans are serious, Italians are excitable, reflecting how they would react to an English person who spoke with their characteristic intonation patterns—high pitch by Italians, low falls by Germans. An English newspaper finds, for instance, that Arnold Schwartzenegger speaks in a ‘flat Austrian monotone’. In these cases the listener is interpreting the intonation as conveying the emotion that an English speaker would intend rather than as the transfer of features from a different intonation system. Unfortunately native speakers of a language do not interpret a mistake in intonation as a foreigner’s mistake but regard it as a deliberate attempt to convey a particular emotion. Despite having taught intonation, I only realised Good bye with an emphatic high fall on bye was an intonation mistake when she said it every week.

For the most part, English intonation adds an overtone to a word rather than forming an integral part of it. High fall Yes is yes plus polite interest; Jones is Jones plus polite query; low rise aren’t you is aren’t you plus demand for agreement. The meaning of the word itself does not change with the intonation pattern; yes is still yes whether it has a high-rise or a low-fall.The function of intonation in ‘tone’ languages like Yoruba or Chinese is very different. In the Songjiang dialect of Chinese, for example, the syllable di can be pronounced with three different tones. di with a high-fall means ‘low’; di with a level tone means ‘bottom’; di with a high rise means ‘emperor’. The same sequence of sounds is three distinct words, depending on the tone used. The Chinese tone is an integral part of the word which distinguishes it from other words, rather than something that is tacked on to give a grammatical or emotional extra.

The fact that intonation in tone languages like Chinese shows differences between words rather than emotions and attitudes inevitably poses problems when its speakers come to learn a language such as English. Chinese learners may have difficulty in using intonation to convey emotional overtones in English and hence their English may sound unvarying in emotion. Going the other way, English speakers may easily say the wrong Chinese word if they use the wrong tone: a person who went to a shop and asked for li zi (rising tone) would get pears; li zi (fall rise) would get plums, and li zi (falling) chestnuts. [Note some Chinese students tell me this example doesn't work, though it was supplied by a Chinese teacher]

3. Producing the sounds of speech

Sorry the symbols don't work well in this section.

This section outlines the complex ways in which speech sounds are made, starting with vowels and going on to consonants. As some parts are unavoidably dense, it can partly be used as a reference source for later chapters rather than read straight through.

Even in languages with a writing system based on sounds, the letters of the alphabet reflect the sounds of speech rather poorly. English has a single "th" spelling in thing and then but two different ‘th’ sounds. It is therefore necessary to use a phonetic script that reflects the different sounds of speech more accurately than the usual written forms. So the English word thing is transcribed with a phonetic symbol for the ‘th’ sound / /, plus one for the vowel / /, and one for the final / /, i.e. / /. The word then, however, starts with the other ‘th’ sound /ð/, followed by an /e/, followed by a different nasal /n/, i.e. /ðen/.

The symbols of phonetic transcription are usually distinguished from normal writing by being put within slanted lines or square brackets: /ki:/ versus key or/men/ versus men. Since the reference point here is mostly English, the phonetic symbols are those conventionally used by linguists for describing English, found with slight variation in most dictionaries and guides to English pronunciation. One minor difficulty is that British and American linguists have alternative symbols for some of the sounds of English, for example /a/ versus // for the vowel of hot.

The symbols for English ultimately relate to the phonetic alphabet laid down by the International Phonetic Association (IPA), which was devised a hundred years ago to provide a means of writing down the sounds of all languages in a consistent fashion and has been revised many times since. The figures for the languages of the world in this chapter are based on those calculated by Ian Maddiesen from a sample of 317 representative languages in the UCLA Segment Inventory Database (UPSID). The percentages for particular features refer to this sample rather than to all the languages known to exist.

The major division of speech sounds is between the ways in which vowels and consonants are produced. In vowel sounds, the air comes out of the mouth without any obstruction from the tongue, lips, etc - the /a:/ in spa or the /i:/ in tea. In consonant sounds, the smooth air-flow through the mouth is obstructed in some way by the tongue or lips - the /t/ of tie or the /f/ and // of fish. This definition of vowels as unobstructed and consonants as obstructed works for most speech sounds, though it leaves a few doubtful cases, such as the /w/ in win which is produced like a vowel without obstruction, but acts like a consonant in occurring at the beginning of the syllable rather than in the middle.

A Vowels

As well as having smoothly-flowing air, a vowel involves ‘voice’ produced by the ‘vocal cords’ in the throat. These are flaps in the larynx which open and close rapidly during speech to let out puffs of air, producing a basic vibrating noise, in much the same way that a saxophone reed is vibrated by blowing through it. How fast the vocal cords vibrate affects the pitch of the sound; the individual vibrations can be felt in very slow speech. This sub-section deals with ‘pure’ vowels which have a continuous sound; ‘diphthongs’ in which the sound changes are described in later sections.

The two dimensions of tongue position

The sound produced by vibration is modified by the size and shape of the air spaces through which it then passes. A baritone saxophone produces a deeper note than a soprano saxophone because its internal air space is far larger. The characteristic sound can be changed within limits by making temporary adjustments to the permanent air space; saxophone players alter the length of the air space inside the saxophone tube with their fingering to change the note that comes out.

In the same way, speaking modifies the space inside the mouth by altering the position of the tongue. When the tongue is towards the front of the mouth, the empty space takes a particular overall shape, thus affecting the sound. The sounds produced with the tongue towards the front of the mouth are called ‘front’ vowels, for instance /e/ (men) and / / (man). When the tongue is moved towards the back of the mouth, a different shaped air space is created, producing ‘back’ vowels, such as /u:/ (loot) or // (lot). In between the front and back positions of the tongue come ‘central’ sounds, such as the // vowel (bird). All vowels vary in a dimension from the front to the back of the mouth.

The ‘height’ of the tongue also affects the air space in the mouth. The // in sit, the /e/ in set, and the // in sat are all front sounds but differ in height. When the tongue is raised towards the roof of the mouth, a ‘high’ vowel is produced like / / (sit) and /u:/ (room). When the tongue is lowered towards the bottom of the mouth, ‘low’ vowels are produced such as / / (sat). In between come ‘mid’ vowels such as /e/ (Ben) or / / (firm). Much of the variation in vowels amounts to changes in the position of the tongue in the two dimensions from front to back and from high to low.

The space inside the mouth can be represented abstractly as a box with two dimensions. To describe a vowel means deciding on the highest point of the tongue as specified by these two dimensions: /u:/ (moon) is a back high vowel as the tongue is highest at the back: / / (cat) is a front low vowel because the highest point of the tongue is at the front and slightly above the bottom; / / (fur) is a mid central vowel because the tongue is in the middle in both dimensions; and so on. It is not just the maximum height that affects the sound but the whole modification of the air space within the mouth that this entails. While the point of maximum height is a convenient reference point for specifying any vowel, this specifies the shape of the whole mouth cavity rather than itself being crucial.

All vowels can be located somewhere within this two-dimensional space.

Often, to make the description easier, the two dimensions are each divided into three areas and the front of the mouth is slanted to correspond better to the shape at the front of the mouth:

To make the locations within this space precise, the different points on the perimeter are assigned ‘Cardinal vowels’, rather like the points of the compass. The Cardinal vowels are theoretical rather than having an actual existence in any language. The most extreme close and high vowel that the human mouth could possibly produce is Cardinal [i], the most extreme close and back (and rounded) vowel Cardinal [u], the extreme front and open is Cardinal [a], and the extreme back and open is Cardinal [‡ ]. Other cardinal vowels are provided for each of the reference points on the perimeter and for rounded versus unrounded vowels. Thus any vowel in any language can be located with reference to this grid. English /i:/ bee for example is near to cardinal while English // dog is at the back, a fraction above cardinal /‡ /. The following diagram also gives the approximate positions of the RP /u:/ (moon), and / / (pat). The RP ‘pure’ vowels are given in full in the diagram in the box.



This diagram is as abstract as representing the solar system as rings round the Sun.

A language needs to make its different vowels contrast by spreading them around these two dimensions. The greatest contrast is between vowels that are as close as possible to the opposite corners of the vowel space—as far back or front as possible, and as high or low as possible. The minimal vowel system for a language takes advantage of these contrasts by having three vowels, one high front vowel , one high back vowel /u/, and one central low vowel /a/ (closest perhaps to the starting point of the English diphthongs /ai/ buy and /aœ / cow), which can be shown in a triangular shape, with the two-dimensional box now understood rather than drawn:

      i      u

         a

Counting ‘pure’ vowels rather than diphthongs, eighteen languages in UPSID have only three vowels, mostly conforming to this triangular pattern of /i/, /u/, and /a/, for example Arabic, Greenland Inuit, and Dyribal, an Australian Aboriginal language. The vast majority of languages incorporate these three vowels within their sound system; 91.5% have an sound, 88% an /a/, and 83.9% an /u/. People who were asked to distinguish from /u/ in an experiment only made two mistakes out of ten thousand attempts, suggesting that these two sounds are indeed as different as they could possibly be.

Going from the minimum number of vowels to the maximum, some languages have up to 24 vowels; thirteen languages in UPSID have 17 or more. The most common number of vowels in a language is in fact five. Greek, like many languages, gets five by adding two mid vowels to the three-vowel triangle: zari (dice), /u/ uranós (sky), /a/ agapi (pure love), /e/ erhome (come), /o/ violi (violin).

i         u
e        o
    a

Other languages with a five vowel system of similar types are Japanese, Spanish, Zulu, and Basque.

Lip shape in vowels

Changing the shape of the lips is another way of modifying the sound that comes out. English front vowels like /i:/ (see) are made with unrounded lips. while back vowels like /u:/ (ooze) require the lips to be rounded. Though there is no logical reason why back vowels should be rounded and front vowels should not, front vowels are in fact unrounded in 94% of languages, back vowels rounded in 93.5%. Even by the age of four months, babies are able to tell that requires spread lips, and /u/ rounded lips.

Two familiar exceptions are the rounded front vowels in the /y/ vowel of French rue (street) and the German vowel /y:/ Hüte (hat). It is an odd coincidence that French and German from different branches of European languages both have this rare feature found in only 6% of languages, at least one excuse for the difficulty they pose for English students.

Length

The sounds of speech also differ in terms of how long they take to say—’length’. A long sound is indicated in phonetic script by a following colon ":". In a few languages consonants differ in terms of length. For example in Slovak vrch /vrÙx/ (summit) has a short /r/ but vräsèok /vr:§ok/ (hill) has a long /r:/. In Finnish short /l/ as in /â eli/ (brother) differs from long /l:/ as in /â el:i/ (gruel).

More commonly, languages use length to distinguish vowels. The /i:/ of bean is long while the // of bin is short: the /u:/ of moon long but the /u / of wood short. Length effectively doubles the number of potential vowels by having pairs of long and short at a particular point in the vowel space. A five vowel triangular system becomes a ten vowel system by using length as an additional factor, for example in Hawaiian.

Other factors are often tied in with length. In the long /i:/ of English beat, the tongue position is also slightly higher and fronter than in its short counterpart, the // of bit, and the muscles of the lips and tongue are slightly more tense. Similar slight differences are found in other long/short pairs such as the long relaxed /˜:/ in dawn versus the short tense // in don, one reason why different symbols are used for the two vowels /i:/ beat and // bit as well as the length marker ":". That is to say, long vowels tend also to be said more tensely than short vowels, and to have slightly different tongue positions.

The vowels that have been described so far are all stationary in that the tongue keeps more or less the same position however long they are said. Technically the name for this is a ‘pure’ vowel. Some languages have more pure vowels, some less. Korean has 18 vowels, 16 of which are pairs of long and short; !Xu, a southern African language spoken in the Kalahari desert region, has no less than 24 pure vowels. RP has eleven or twelve pure vowels, depending whether the long // of bird is counted as a different vowel from the short // of asleep, known as ‘schwa’, which only occurs in unstressed syllables. Indeed in some transcription systems the vowel of bird is given a separate symbol //

Diphthongs

Diphthongs are a type of vowel in which the tongue moves from one vowel position to another while the vowel is being produced. The vowel sound is not the same at the beginning as at the end. The method of describing diphthongs is to state their starting point and the destination towards which they move (but do not necessarily reach).

In the English /˜’ / of toy the starting point of the tongue is the back mid position, the destination towards the front high /’ / position, as in: In the English /‘œ / of go, however, the tongue starts centrally and moves back and up towards the /œ / position.

Because diphthongs involve movement, it is impossible to produce them continuously; the listener ends up hearing only the second vowel. RP has seven or eight diphthongs, depending whether a speaker pronounces words like poor with a diphthong /œ ‘/ or with the same /˜:/ sound as in paw.

English Diphthongs
moving towards front high lane line loin
moving towards back high cone
cow
moving towards central
beer bear sure (in some people’s speech

While the figures for diphthongs in the world’s languages are not very certain, the commonest seem to be /ei/ and /au/, the rarest /˜ i/. The language with the highest number is !Xu with no less than 22.

Pure vowels and diphthongs are two varieties of vowels that differ only in whether the tongue moves. Despite its overtones in ordinary language, the word ‘pure’ is in phonetics a technical term for a continuous sound made without the tongue moving and without other obstruction. Indeed some ‘posh’ British accents tend to turn pure vowels into diphthongs; bed is /be d/ with a suggestion of an er // coming in rather than /bed/, bad is /ba d/, and so on.

English Vowels around the World

Here are the vowels of RP English located within the Cardinal Vowels diagram. The vowels outside the figure are the Cardinal Vowels themselves.

The following chart comparing the RP vowels with other accents of English is based on John Wells, Accents of English. It gives the pronunciation for various test words having ‘pure’ vowels in the different accents; it disguises many differences in pronunciation, particularly the effects of /r/, to be discussed in Chapter Seven.

2. Consonants

Consonants differ from vowels because the lips or tongue disturb the stream of air rather than letting it flow out smoothly. Since they are produced by obstructing the air, this section describes where the obstruction is—lips, teeth, and so on—, what forms the obstruction—tongue, lips etc, and the manner in which it is made. In common with vowels, consonants may , but do not have to, use voice from the vocal cords and may be said with tense or relaxed muscles.

 

Plosive consonants

One method of producing a consonant is to interrupt the flow of air from the mouth by blocking it for a brief moment. Consonants such as /b/ in brain and /k/ in crane are known as ‘stop’ or ‘plosive’ consonants because the flow of air is stopped and then ‘explodes’ abruptly out from behind the obstruction. English plosive consonants block the air at three different places, shown in the following diagram of the mouth.

Fig. 2. Reference points for the production of consonants

When the air is temporarily blocked by both lips, the consonants produced are the voiced /b/ (lab) or the voiceless /p/ (lap). When it is the tip of the tongue that blocks the air by contacting the ridge behind the teeth (the ‘alveolar ridge’), the consonants are the voiced /d/ (dime) or voiceless /t/ (time). Placing the back of the tongue against the back of the roof of the mouth (the ‘soft palate’ or ‘velum’) produces the voiced sound /g/ (get) or the voiceless /k/ (kid).

The six English plosives come in three pairs of voiced and voiceless /t/ /d/, /k/ /g/, /p/ /b/. The voiceless member of the pair is usually said more energetically than the voiced member. So the /b/ in Bart is not only voiced compared to the /p/ in part but is also said with more energy. The English plosives use three points of contact—the lips, the alveolar ridge behind the teeth, and the soft palate. According to UPSID, all languages have at least two of these contacts for plosives and 98.4% of languages have all three. A few languages use other contact points; Tamil has a dental plosive using the teeth rather than the more usual teeth ridge, Arabic a uvular plosive /q/ at the far back of the mouth as in qal /qa:l/ (he said).

Fricatives and other consonants

The second major method of producing consonants is to let the air escape through a narrow gap rather than blocking it completely —‘fricative’ sounds. English fricatives involve three of the same places of contact, the lips and teeth /v/ (live) and /f/ (life), the back of the teeth /ð/ (this) and / / (thick), and the teeth ridge /z/ (rise) and /s/ (sip). In addition the fricatives // (garage) and // (fish) use the teeth ridge, but differ from /z/ and /s/ in that the tongue is further back and lets the air escape over a larger area. Because of their distinctive hissing noise, these four are sometimes grouped together as ‘sibilants’.

While plosives and fricatives form the two major groups of consonants, this does not exhaust the possibilities. Nasal consonants, for example, require the mouth to be blocked and the flexible ‘soft palate’ at the back of the mouth to be lowered to force air out through the nose. Nasal consonants differ according to where the air is held up in the mouth, again the three positions of the lips /m/ (sum), the alveolar ridge /n/ (sun), and the soft palate // (sung). Most languages have the same three nasals, for example German. Some, such as Greek and Turkish, have only /m/ and /n/; a few, such as Inuit, have a uvular nasal /n/ in which the tongue blocks the mouth right at the back; some, such as Irish, have an additional point of contact on the palate halfway back on the roof of the mouth, making a palatal nasal /– /. While nasals are usually voiced, as in English, they may also be voiceless, as in Burmese.

Vowels too can be ‘nasalised’ if some air escapes through the nose at the same time as through the mouth. In Bengali, the triangular seven-vowel system is doubled at each position by a nasalised counterpart. Nasalisation occurs occasionally in isolated words of English, such as the final nasalised vowel in my own pronunciation of restaurant. In French such vowels are far more frequent. Syllable-final /n/ was lost in many French words but replaced by nasalisation of the final vowel of the syllable—fin (end), son (his), rien (nothing).

So it is not just the English plosives that make use of the same four points but also the fricatives and nasals. All the points of contact have two or more of the sounds that are possible, divided into pairs of voiced and voiceless, apart from the nasals.

Out the English sounds in four columns, however, shows up some gaps. For example, there is no pair of voiced and voiceless velar fricatives at the soft palate to parallel those at the other three places. German fills this gap with the fricative sound /x/ in Tuch (towel).

The four columns of the diagram conceal further possible sounds that do not happen to occur in English. English /f/ and /v/ involve the bottom lip and the upper teeth (labiodental) rather than both lips together (bilabial). The English fricatives /f/ and /v/ do not match the lip plosives /p/ and /b/ in the same way that the fricatives /s/ and /z/ match the plosives /t/ and /d/. So, not surprisingly, some languages have bilabial fricatives. Greek for instance has a voiced fricative /º / involving both lips in words like biblio (book).

The other possibility concealed by the English chart is the existence of a fifth column for sounds made on the roof of the mouth itself, the ‘palate’. Many languages have such palatal consonants. Greek has a palatal fricative // in words like /stio/ (ghost). French has a palatal nasal /–/ in /a–o/ agneau (lamb). Irish has both a fricative palatal // in /i‘/ oiåche (night) and a nasal palatal in words such as /a–ir/ Ainir (maiden).

The missing fricatives occasionally occur in isolated words of English, either in words like loch or in foreign words such as Bach. Sometimes they occur as an accidental combination of consonants in words like huge or Hume where the /h/ and the /j/ can coalesce as a single palatal fricative /ç/. English has not always lacked these fricatives. Up to around 1400 AD, there was a velar fricative /x / in words such as night and through. The "gh" letter-combination that stood for this sound has posed a problem for English spelling ever since /x/ was lost from the spoken language.

A sixth column is also required for sounds that are made by tongue contact even further back in the mouth, ‘uvular’ sounds. Examples are the voiceless plosive /q/ found in Arabic qal /qa:l/ (he said).

The English four-column layout also implies that dental fricatives like the English /ð/ and / / in this and third are typical. In fact this pair are found in comparatively few languages, unvoiced / / in 18% and voiced /ð/ in 21% of the UPSID sample. Hence they pose a problem for nearly all overseas students of English.

As well as these main types of consonant, there are several minor types, illustrated in English by:

  • the ‘lateral’ /l/ of lust, made by keeping the tongue in contact with one side of the teeth ridge only—the only English case where the left/right third dimension matters for speech sounds. Most languages have a single lateral sound, like English; some have two, one at the alveolar ridge, one at the palate, for example Italian and Brazilian Portuguese. Japanese has no laterals, hence the confusion of /l/ and /r/ lubbish for rubbish by Japanese speakers of English. A Japanese student at Essex university explained to his neighbours ‘I’m a Buddhist. Do you mind if I pray at the same time each day?’; a few weeks later they asked him why they never heard the sound of his flute if, as a flautist, he was playing it every day. Welsh has two types of /l/, one a voiceless fricative / / as in pill (piece of poetry), the other a voiced lateral /l/ as in pel (ball).

  • the ‘affricates’ /t/ (cheat) and /d/ (just), which combine a plosive and a sibilant fricative, i.e. the /t / of cheat is a combination of the /t/ of tin with the // of shin.

  • the complex /r/ sounds in red and very. In RP these are often like vowels in that they involve little or no tongue contact. They are idiosyncratic compared to many languages of the world which have varieties of /r/ that are rolled and trilled, or that involve contact between the tongue and the roof or back of the mouth, such as the uvular /¸ / found in French rouge (red). The use of /r/ varies from one dialect of English to another, particularly between England and the United States, as we see in Chapter Seven. Hence, unlike most other phonetic symbols, /r/ refers to a ‘family’ of related sounds rather than uniquely referring to a single sound.

  • the peculiar fricative /h/ of hot, made by breathing out noisily before a vowel. The sound of /h/ then varies according to the vowel that follows it. Pronouncing words like hat with an /h/ as /ht/ rather than as /t/ is a marker of class in England, though not in most of the rest of the English-speaking world, again to be discussed in Chapter Seven.

  • the ‘approximants’ /w/ (wish) and /j/ (yet). These behave like consonants in coming at the beginning of syllables but are produced like vowels without contact.

  • the cough-like glottal stop [Ö] produced by keeping the vocal cords shut for a moment, thus creating a plosive. In some accents of English a glottal stop is a alternative form of /t/ between vowels. better in my own speech is often /beÖ/ rather than /bet /, as it is for 27% of the Inside Language panel (Q8). The glottal stop functions is a fully-fledged sound in many languages rather than an alternative pronunciation, for example Hebrew and Burmese.

Putting all these together, English has 24 consonants, close to the average 22.8 for a language. The proportion of consonants to vowels is 1.27 to one, slightly low compared to the world average of 2.5 to one; that is to say, proportionately English has rather more vowels than the average.

4. Air and speech

So far this chapter has taken it for granted that the air for speech is produced by the lungs breathing out. In order for speech to be regular, this lung air has to come out at a fairly constant pressure, regulated by complex muscles in the diaphragm and the rib-cage. Otherwise speech would be high in pitch just after speakers breathe in and would tail away to low pitch as they go on—this effect can easily be seen if you try read a long sentence aloud on one breath.

No languages seem to use indrawn breath for normal speech. There are, however, occasions when it is used to disguise the speaker’s voice. Suitors serenading their loves are said to use in-drawn air in some parts of the Philippines and in German-speaking Switzerland in order to preserve their anonymity. English has a minor non-speech use of indrawn breath in the sound made when one burns oneself accidentally.

The lungs are not the only source of moving air. Southern African languages such as Nama and Zulu use sounds called ‘clicks’ produced by the tongue sucking air into the mouth; they will be familiar to listeners to music from this region such as the Xhosa click songs of Miriam Makeba. English has some marginal non-speech clicks, for instance the giddyup noise made to horses or the tut-tut noise of disapproval.

Tongue Twisters

The point of a tongue-twister is to confuse the language system in the mind by repeating related sounds over and over again.
Mrs Pipple Popple popped a pebble in poor Polly Pepper's eye.
Charlie chooses cheese and cherries.
Old oily Ollie oils oily automobiles.
He ran from the Indes to the Andes in his undies.
Rubber baby buggy bumpers.
Shave a cedar shingle thin.
This thistle seems like that thistle.
Unique New York!
Miss Ruth’s red roof thatch.
Peggy Babcock.
Toy boat.
Any noise annoys an oyster but a noisy noise annoys an oyster most.
Tongue-twisters in different languages:
Tres tristes tigres trillaron trigo en un trigal. (Spanish: Three sad tigers threshed wheat in a wheat field)
Nama-mugi, nama-gome, nama-tamago. (Japanese: raw wheat, raw rice, raw eggs)
Le ver vert va vers le verre vert. (French: the green grub goes to the green glass)
Nie pieprz wieprza pieprzem. (Polish: do not pepper the hog with pepper)
Un limon, mezzo limon (Italian: one lemon, half a lemon)
Surrealistic aphorisms by Marcel Duchamps
Abominable fourrures abdominales. (abominable abdominal furs)
My niece is cold because my knees are cold.
Etrangler l’étranger. (strangle the stranger)
Examples invented for a competition by nine-year-old children in Ardleigh, a village in Essex:
Super-sonic sausages.
The stranger strangles Susey with some long stretchy string. Tongue twisters give me blisters.
Bob and Bill brought bits.
My monkey mistakes my mum’s messy mixture for a monkey.
Trees with green leaves.
Clearly not all the children have understood how a tongue-twister works.

Main source: A. Schwartz, 1972

Timing of voice in consonants

Voice has been treated thus far in a simplified fashion as either ‘on’ or ‘off’: either the vocal cords vibrate, as in // (ship), or they do not, as in /t/ (tip). A crucial factor, however, is the moment at which voicing starts during the production of the consonant.

Take the voiced plosive /g/ in gate and its voiceless counterpart /k/, in Kate. The precise moment when voicing starts is called the ‘Voice Onset Time’ or VOT, timed in milliseconds (msec) relative to the actual moment of release of the air. Thus voice can start before the moment when the consonant is released, termed negative or -VOT, at the moment of release, 0 VOT, or after the moment of release, positive or +VOT. The Voice Onset Time for English /g/ therefore varies from about -88 milliseconds before the stop is released to nearly the same moment as the release, -21 msec. The sound /g/ is heard as voiced so long as the speaker starts voicing within these time limits.

In the voiceless English plosive /k/, voicing starts around +80 msec after the stop is released, that is to stay voicing does not start till a fraction of a second after the release, thus creating a characteristic puff of air after the consonant before the next vowel starts known as ‘aspiration’.

A stop with a delay in voicing of +80 msec will be heard as voiceless. An early VOT makes the listener hear a voiced plosive; a late VOT a voiceless plosive. Voice Onset Time varies between languages. A Spanish /g/ such as gato (cat) resembles an English /g/ in that voicing starts earlier than for /k/ with a Voice Onset Time of -108 msec. But the voiceless /k/ queso (cheese) has a Voice Onset Time simultaneous with the release, +29 msec, rather than much after it, +80 msec.

 

Overlap of English /g/ and Spanish /k/
Consequently Spanish does not have the wide tolerance in VOT for /g/ allowed in English. An English person may take a Spanish /k/ to be a /g/; a Spanish person take some English /g/s to be /k/s, seen in the overlap in the following diagram.

The two languages both use voice to distinguish sounds but they use it differently, just as a Hong Kong dollar is worth less than a Singapore dollar. The languages have settled on different ways of making the voice distinction. This voiceless burst of air is in a sense accidental in English. A /p/ before a vowel will have an aspirated puff of air after it, as in pit; a /p/ following /s/ may not, as in spit. Using too much aspiration or too little will not interfere with the meaning of the sentence.

However, in languages such as Hindi there are two different sounds, one with, one without aspiration—/pá ‘l/ (fruit), and /p‘l/ (moment). Potentially there is a three-way distinction between early VOT, 0 VOT, and late VOT, rather than the two-way choice of late and early VOT found in Spanish. Thai and Burmese also have three distinct sounds at the dental position.

The idea that speakers of a language divide up sounds into either/or distinctions leads to a general characteristic of languages: speakers perceive speech sounds as distinct categories rather than as continuous variation. A sound is either a /g/ or a /k/, never something in between. Experiments have tested how people perceive synthesised sounds that gradually increase in VOT. It is not that there are two areas within which people are certain of which sound is involved and a grey area in the middle where they are uncertain. Instead they are committed to one sound up to the particular point at which they switch to the other, even if they differ over the location of this point. Though VOT is a continuous scale, it has a cut-off point where the listener has an either/or choice. One characteristic of human beings seems to be that they cannot hear intermediate types of speech sounds but force them into one or other of the categories of the language. This ability is called ‘categorial perception’, that is to say, perceiving sounds as discrete ‘categories’ rather than as a continuous variation. Like a piano defining 85 notes from a vast range, the human mind perceives sounds as separate items.

5. Combining sounds into syllables

Suppose you had to decide on a name for a new washing powder. A computer produces a list of possible names: Mrah, Bliff, Bnill. Which would you choose?

Though each of the sounds in these words is English, only one of the words is in fact possible, namely Bliff. For this is the only one of the three that conforms to the rules for making English syllables. The differences between the words are not in the actual sounds, all of which are possible in English, but in how they are combined. The crucial element for combining sounds together is the syllable. This section looks at the ways in which syllables are constructed, which varies from one language to another. Syllables have centres, usually vowels; they have beginnings and endings, usually consonants. Sounds vary between the more vowel-like ones that occur in the centre of the syllable where the airstream is least obstructed, and the more consonant-like ones that occur at the beginning or end that have most obstruction, technically called the ‘sonority hierarchy’.

In English the centre of every syllable must be a vowel (V), including both pure vowels and diphthongs. So the minimal syllable is a vowel V on its own—the /a/ of eye or the /‘/ of the article a in a book. The few exceptions without a vowel are syllables where the nasals and /l / behave more like vowels than consonants by occupying complete syllables of their own, as in bottom /b‰ tmÙ / or bubble /b Š bl Û /; hence these are known as ‘syllabic consonants’. Most languages indeed require a syllable to have a vowel. Slovak, however, uses /r/ and /l/ as the centre of syllables, for example in vrch /vrÚ x/ (summit) and vlk /vl Û k/ (wolf).

All languages also have syllables with a Consonant Vowel (CV) structure, whether English buy, French toit (roof), or Arabic /la:/ (no). These can build up words with sequences of CV syllables, English CVCVCV banana, French CVCVCVCV réparations (repairs), Arabic CVCV / mi/ (Syrian/Damascene).

In some languages all syllables consist of this consonant vowel CV structure. Think of familiar Japanese brand-names such as Toyota, Mitsubishi, Yamaha. All Japanese syllables must be CV, with the exception of final ‘syllabic’ /n/, as in san (a respectful form of address), and doubled consonants as in nippon (Japan). That is to say, Japanese syllables may not start with more than one consonant and must not have a final consonant: they are all CV rather than CCV or CVC. To the foreigner, Japanese words seem comparatively easy to make out. Train announcements allow one to work out the name of stations in a series of obvious CV syllables, for example Shimokitazawa, or Akasakamitsuke. Sumo wrestlers have names that are also quite easy to recognise, Takanohana or Kyonoumi.

Other languages, however, permit a greater range of syllables by allowing. a final consonant, that is to say consonant-vowel-consonant CVC. English not only has V syllables like /ai/ (I) and CV syllables like /bai/ (buy), but also has CVC syllables like /tm / (Tom) and /bt / (bit). French similarly has CVC syllables in femme /f am / (woman), and tet /t t/ (head).

The final or initial consonant in the syllable can either be restricted to single consonants C or can be two or more consonants CC, CCC, and so on. English has some CCV syllables with two initial consonants, as in stay /stei/ or play /ple’ /; French has /tr/ as in travail (work) or /st/ in stylo (pen). There are, however, strict rules about which consonants can go together, some which apply to all languages, some only to the sound system of a single language. For example /pn/ and /ps/ are possible CC combinations in German, as in Psychologie (psychology) and Pneu (tyre), but are not allowed in English, which has /s/ and /n/ in the parallel words psychology and pneumatic. Leaving aside combinations with /s/, only four consonants can occur as the second part of an English initial two consonant CC cluster: /l/ as in please, /r/ as in trip, /j/ as in tune, and /w/ as in quick (the last two being ‘approximants’ rather than true consonants). Many combinations like fmooth and ptick are therefore ruled out. While /s/ can precede several second consonants, such as /st/ in sting, /sw/ in Swede, or /sp/ in spin, it cannot be followed by a fricative; there is no word like sfang. Other possible combinations do not actually occur, perhaps for accidental reasons, for instance /sb/ as in sbang or /tl/ as in tlun. Some combinations are rare. /vj/ for example occurs only in view, /lj/ in lute, /sj/ in a few words in some people’s speech, such as suit and assume.

A particular language may also restrict which single consonants can occur in the initial or final C position of the syllable. The English sound /„/ cannot occur at the beginning of a syllable: sing or ringer or incredible are possible but not ngis /„ ’ s/ or nging /„ ’ „ /.

The English /h/ only occurs at the beginning of the syllable, as in hot or perhaps, never at the end, despite its occasional presence in the spelling. That is to say, there are no English words pronounced as toh /t‘ œ h/ and perpah /p‘ pa:h/. The absence of final /h/ seems obvious to an English speaker. After all, how can one hear a sound that is just air breathed out if nothing follows it? It is equally obvious to a speaker of Persian that the word /mah/ (moon) differs from the word /ma/ (us) and that /j‰ h/ (position) differs from /j‰ h/ (place); Persians have no problems in hearing a final /h/.

English-speaking science fiction writers have the problem of inventing names for aliens from other worlds that seem both plausible and exotic. One possibility is to scatter a few apostrophes to indicate some non-English sound, presumably a click, for example Halyan’t’a. Another common approach creates possible, but non-existent, English words out of ordinary sound combinations, such as Krondaku, ezwal or Ondods. Or English sounds are put in unusual combinations, tnuctipun, Kdatlyno, Kzinti; villainous races can also usually be identified by the plosive sounds in their names.

The English consonant combinations that exceed two consonant CC combinations are very restricted. In syllable-initial combinations, the first consonant must be /s/, the second /p/, /t/, /f/ or /k/, and the third /l/, /r/, /j/, or /w/, as in splinter /spl/ or scream /skr/. Even then certain combinations do not exist, say /stw/ in stweel or /spw/ in spweet.

While the end of the English syllable can go up to four consonants (CCCC) these also consist of only a few combinations, usually ending in /s/, for the simple reason that English grammar adds /s/ to the end of words to show number of nouns (books) and verbs (sits) and possession of nouns (John’s), leading to /lfq s/ in twelfths, /mpts/ in prompts, and /ksts/ in texts. Because of the problems of saying such sequences, English speakersoften simplify them by leaving one or more of the consonants out, i.e. /twelq s/ for twelfths without an /f/. The absolute maximum for English is claimed to be the final five-consonant cluster /mpfst/!in Thou triumphst!, if that is the speaker says -mphst with a /p/.

How do people cope with combinations that are not found in their own language? Science fiction name like tnuctipun or fnool in fact pose little problem: the impossible CC /tn/ and /fn/ combinations are padded out to be a separate CVC syllable of English by adding a vowel /‘/ to get /t‘n/ and /f‘ n/. The same happens to foreign names that do not conform to English rules; Rwanda often gets an extra /‘/ to produce CVC at the beginning /r‘ w¾ nd‘ /, as shown for instance in a notice outside a Bodyshop: This branch has collected £200 for relief for Rawanda. The Jaffa orange box labelled Tnuport Produce of Israel is doubtless called /t‘ nu:p˜ :t/ by those who handle it.

People learning English as a foreign language use the same strategy when a syllable does not conform to the pattern of their first language. As Arabic is a CV language, it has no CC combinations. So speakers of Arabic turn English plastic into belastic and translate into tiransilate. Maori speakers similarly turn milk into miraka and bridge into periki to avoid the CC combinations /lk/ and /br/. And it is fairly obvious which English words the following Tok Pisin words are based on: bilak, bulu, sikin, pilum.

In Japanese the attempt to make English syllables conform to the CV structure makes English words almost unrecognisable. To discover which English words are the basis of the Japanese sutoraiki, sumaato and sutandoba, delete the Vs which are padding out the CC clusters, to get s.t.rik.e, s.maat. (smart) and s.tand.bar (i.e. a bar where you don’t sit down). English people similarly add a vowel to make the German "kn" sequence conform to the English pattern in words such as Knorr /k‘ n˜ :/.

To sum up, the sequences of sounds in a language are tightly restricted in terms of the combinations of consonant that can occur at the beginning or ending of the syllable. Speakers feel uncomfortable with syllables that do not occur in their own language and attempt to make them conform to their expectations in various ways.

6. Sounds and phonemes

The vague word ‘sound’ has been used up to this point to talk about speech. But there are different levels of speech sound. One level is the actual description of the sounds themselves as sheer ‘sound’, studied in the branch of linguistics called ‘phonetics’. The next level is the sound system of a particular language. English or French or Japanese use a small selection of the sounds possible in human languages, the subject matter of ‘phonology’. The present section then looks at this next level, namely how particular languages use sounds within their own phonological systems.

What does it mean to say English has 44 sounds? There are several distinct ways of pronouncing /p/ in English—after /s/ as in spy, before a vowel as in pat, and at the end of a word as in sap, the sound that precedes or follows it influencing the VOT. How can the sounds of a language be counted when any sound can have all these variations?

The solution is a second level of sounds, called ‘phonemes’. These are the sounds that the native speakers of a particular language use to distinguish different words, that is to say are part of its phonological system. If speakers of English hear someone say /pi:k/ (peak) and /bi:k/ (beak), they recognise that /p/ is not /b/; in other words they hear two different words with different meanings, peak and beak. When they hear the /p/ sounds in pit, spit, split, and stop, however, they still recognise the sound as a /p/ despite the differences.

The sounds /p/ and /b/ are therefore phonemes of English because they potentially distinguish words such as peak and beak. The distinction between the sheer sounds of speech and the phonemes of a particular language is often shown in phonetic transcript by enclosing sheer sounds in square brackets [it] and the phonemes of a particular language in slanting brackets /it/.

Hindi has two words /phik/ and /pi:k/, one with, one without the following puff of air, symbolised by "h". That is to say, Hindi has two /p/ sounds, one with 0 VOT, one with +VOT; both of them are distinct phonemes. The same difference occurs in English in the [p] of spit and the [ph] of pit but it is heard by English speakers as two variants of one phoneme /p/. The technical term for the different ways in which a phoneme can be pronounced is ‘allophone’. A particular language may use a sound difference to distinguish two phonemes, or may ignore it and treat them as the same phoneme. /ph/ and /p/ are two phonemes in Hindi, two allophones of one phoneme in English. Indeed Hindi has an aspirated counterpart to add to the English pair at each of the positions for plosives, /ph/ /th/ /kh/ as well as /p/ /t/ /k/, and /b/ /d/ /g/. English people may be able to tell the difference between /ph/ and /p/ but the difference does not matter to the English sound system since it never in itself marks the sole difference between two words.

A further example is the English lateral /l/ in leap. One type of English /l/, called ‘clear’ [l], is the syllable-initial pronunciation in leap. A second type, called ‘dark’ [l] and transcribed as [l ø ], is the syllable-final pronunciation in peal. Syllable-initial clear [l] sounds like a front vowel because, apart from the tip of the tongue touching the teeth ridge, the tongue has the configuration of a front vowel with the front part raised. Syllable-final dark [lø ] sounds like a back vowel because the tongue has the configuration of a back vowel with the back of the tongue raised. A characteristic of Irish English is the lack of difference between clear and dark /l/s.

Many British people nowadays have yet a third variety of /l/, namely a back vowel similar to the initial sound /w/ of woman, and lacking the tongue contact with the roof of the mouth typical of /l/. The comedian Michael Barrymore has a catchphrase Awright /˜ :waiÖ /. In my own speech full is pronounced closer to /fuw/ than to /ful/, a characteristic of the variety of English often now known as Estuary English after the Thames Estuary where it is allegedly spoken.

15% of the Inside Language panel owned up to this vocalic /l/ (Q8), 77% denied it.

The three [l] sounds are not phonemes of English because the difference is never important to the understanding of speech: an /l/ is an /l/ whichever way it is pronounced. The difference between clear and dark [l]s is entirely predictable from their position in the syllable and never distinguishes two words. Yet, in Polish, lata with a clear [l] means ‘year’ and lø ata with a dark [lø ] means ‘patch’; the two lateral sounds are different phonemes, not allophones, and so the two words have different meanings.

The type of transcript used in this chapter for English is then based on the phoneme: it is a ‘phonemic’ transcription showing the significant contrasting sounds in the phonology of a language, not a ‘phonetic’ transcription showing the minutest variation in sounds. Hence the difference between the transcripts in slanted brackets such as /pil/ pill and those in square brackets such as [pilø ] are that the former are given in a ‘broad’ transcript showing phonemes, the latter in a ‘narrow’ transcript showing different allophones.

The number of phonemes varies greatly from one language to another. The smallest number found in UPSID is the 11 of Rotakas, an Indo-Pacific language, the largest the 141 of !Xu, English coming somewhere in the middle of the range with around 44. The average for a language is in fact 31, with 70% of languages having between 20 and 37.

The traditional way of establishing the phonemes of a language is to look for pairs of words known as ‘minimal pairs’. The linguist asks native speakers of English whether pin is different from pig. If they agree, two phonemes of English have been established—/n/ and /g/. Then they are tried with [pl] and [pw]. They may recognise these are different accents of English and even regard the [w] pronunciation of /l/ as an abomination. But they will still say both words are pill. Next they are tried with peat and pit; then with aspirated [pht] and non-aspirated [pt]; and so on, till all the likely sound contrasts have been tested. In principle this ‘minimal pairs’ technique establishes the repertoire of phonemes in a language.

It is, however, difficult to find minimal pairs for the phonemes / / and / ð/. In the middle of words either /i:ð/ can be paired with /i: / ether, if, of course either is not pronounced /a’ð/. At the end of words, there are a few pairs like sooth /su: / and soothe /su:ð/. It is very hard, however, to find minimal pairs to contrast the initial voiced /ð/ sound of the with the voiceless // of three, the only candidate seeming to be thigh // versus thy /ðai/. The reason is that the initial /ð / sound occurs in ‘grammatical’ words like this and then, rarely in ‘content’ words, as was seen in Chapter Two, and there are only a handful of such words in the language. Ask someone to pronounce the name of a place they are unlikely to have heard before (unless they live in Essex), namely Theydon Bois, and they will pronounce it /eid‘n/, not /¶ðeid‘n/, although the only other word with similar spelling they are likely to have encountered is they /ðei/. Because Theydon is not a grammatical word, it cannot have /ð/. Like intonation, the sounds of a language cannot be divorced from its grammar. It would be difficult to pronounce the sounds of English unerringly without knowing grammatical information about which words are nouns and which are grammatical words.

Similar problems in finding minimal pairs arise accidentally with other phonemes of English. As we have seen, the English phoneme /h/ of hot only occurs at the beginning of syllables, i.e. the first C in the CVC syllable structure; the phoneme // of sang occurs only at the end of syllables, the second C in CVC. It is therefore impossible to find a pair of words where the two can be contrasted and definitely established to be different phonemes. By analogy with the two forms of /l/, arguably /h/ and / are one phoneme with two different sounds.

The common-sense solution is to insist that two sounds as different as /h/ and // are unlikely to be variants of the same phoneme, even if this contrast cannot be shown through minimal pairs; sounds have to be similar to belong to the same phoneme. This solution does not work for languages that have a large range of allophones for some phonemes. Kabardian, a language spoken in the North-West Caucasus, has a single high vowel that has six variants running all the way from front [i] to back [u]. Tamil has a single plosive consonant that may be spoken as [p], [t], or [k], and furthermore is voiced before final vowels; if this were true of English, Poe, toe, Coe and go, would all be the same word. Only the sound system of a language can decide whether two sounds belong to the same phoneme or not.

Minimal pairs in fact became a favourite tool for teaching English to foreigners. One textbook was called Ship or Sheep?; its sequel was Tree or Three? Exercises in some books test the students on whether the teacher has said /i:/ or //, bean or bin, or /g/ or /k/, good or could. My favourite tests the difference between It’s nice, It’s rice, and It’s lice; it is hard to imagine a real world situation where these sentences are equally possible.

Sometimes the teaching materials put the minimal pair in a sentence which the student is asked to repeat: Jean likes gin but gin doesn’t like Jean. Or longer stretches of speech are used that have liberal examples of a sound: Don’t you know Rover’s got no bone? What, no bone for Rover? Rover won’t stay at home unless Rover’s got a bone. Joe, go to Jones ... and so on for another eighteen memorable lines. The fallacy in using minimal pairs for teaching is that they are a linguist’s technique for establishing the phonemes of a new language, not the natural means through which children learn their mother tongue or adults a second language.

However, paying too much attention to the phoneme makes speech seem a sequence of separate sounds rather than the continuous process it is. One solution is to break the phoneme up into smaller elements called ‘distinctive features’. Instead of each sound being an entity of its own, it is seen as a bundle of elements, rather like a molecule made up of different atoms. Each difference between one sound and another is reduced to a yes/no, + or - , choice, called a ‘distinctive feature’. These two-way choices have already been slipped into this chapter several times. Voiced versus voiceless sounds for example were given as +voice and - voice. The sound /b/ of rib is +voice, the /p/ of rip is - voice. Vowels are specified as +voice by definition almost automatically. Other distinctive features that have been used are ±high and ±back. The English /i:/ vowel of see is +high - back, the / / sound of fog is - high +back, and so on for all the other vowels. And the ±tense feature distinguishes relaxed +tense sounds like /t/ (tart) from - tense sounds like /d/ (dart). Distinctive features are a binary code, like that used on computers or CDs, which can capture all possible sounds of speech.

7. Alternatives to speech sounds
Spoken sounds are only one of the means through which language can be expressed. There are forms of language that do not involve sounds produced by the vocal organs. The most obvious is written language, whether using an alphabet based on sounds or a character system based on meanings, as seen in Chapter Five. In Zaire, however, there are drum languages in which the sounds are conveyed on a wooden drum called a boungu tuned to give two notes, Low (male) and High (female), when hit on different sides. Any word can be converted into a sequence of High and Low notes, rather like the Long and Short of Morse Code, and broadcast for up to seven miles on a still night. Thus in Kele a word such as sango (father) is a sequence of two High notes • • HH; nyango (mother) is a Low followed by a High • • LH; and wana (child) is a Low followed by a High • • . To arrive at the drum expression for ‘orphan’ means adding some grammatical words:

    English: child    has    no    father   nor   mother
    Kele:      wana   ati     la     sango    la    nyango
    Drums:  H   L   L H    L     H   H     L     L    H

A further alternative to speech sounds is whistling, which is used to communicate across distances of up to 5 kilometres across thinly populated country, for example by shepherds or by hunters, in parts of the globe ranging from Mexico to Burma to the Canaries. Whistle languages do not convert speech sounds to high and low notes, but substitute particular notes for each vowel with consonants given by transitions between the vowels. Both drumming and whistling convert spoken language into a different medium rather than being an independent form in their own right. In other words, they are like Morse code or shorthand in being parasitic on spoken language.

A true alternative to speech is, however, found in the languages used by the deaf, which involve gestures rather than sounds and are capable of communicating as complex ideas through as complex structures as any other human language. Take two signs from British Sign Language (BSL). The sign for ‘woman’ is the index finger of the right hand stroking the right cheek; the sign for ‘England’ is the two hands in front of the chest with the two index fingers stretched out horizontally moving to and fro, from left to right.

These gestures are just as difficult to describe in words as the sounds of speech. For the gestures of deaf language are organised in the same way as the sounds of speech. Just as the organ making the speech sounds, such as the tongue, needs to be specified so does the shape of the hand, with 51 different handshapes possible in BSL. Then, as for plosives and fricatives and diphthongs, the types of movement need to be described, some 37 for BSL. As with the vowel space inside the mouth, the location where the sign is made needs to be specified, including in BSL nine positions on the face and four on the neck and trunk. Sometimes the same sign has different meanings if produced at a different level, just as a /p/ is different from a /k/. Thus sign language has all the normal possibilities of the phonological system of human languages.

Sign languages should not then be confused with natural gesture systems based on mime. Many deaf language signs may have originated in ‘natural’ gestures: the BSL sign for ‘bird’ is the finger and thumb of the right hand opening and closing at nose level, clearly representing a beak. Most signs have, however, become purely arbitrary; the sign for ‘England’ mentioned above for example is a remote descendant of a finger-spelling sign rather than any recognisable shape. Sometimes fanciful origins for signs have been devised. The BSL cheek stroking sign for ‘woman’ has been explained variously as ‘curls on a woman’s cheek’, ‘bonnet strings’, and ‘soft cheek’. Yet a hundred year ago the sign was stroking the lips, showing that none of these explanations can be right.

While there may be some visual links between some signs and what they mean, these are not much closer than those between natural sounds and the sounds of speech. Indeed otherwise there would not be large differences between the different sign languages of the world, whether Chinese Sign Language, British Sign Language, or French Sign Language. Even within a single country such as France or England there are strong dialect differences. Sign users from different regions may not understand each other completely. Deaf members of a theatre audience in Manchester for example complained that they could not understand the BSL interpreter of a play because he was not using the signs current in that city.

This chapter has then shown that the sound system of a language consists on the one hand of particular intonation patterns, on the other of a certain number of phonemes. The actual sounds are limited by what the organs of speech can do and by universal factors such as distinctive features and sonority. Even when languages have the same sounds, they use them in specific ways according to their own systems. It is the meaningful contrasts between the sounds that are important - High Rise John versus High Fall John, or got versus cot - not the sheer sounds themselves.

 

Sources and further reading

General

Well-written introductions to phonetics and phonology in general, from which many of the examples in this chapter are taken, can be found in: Abercrombie, D. (1967). Elements of General Phonetics, Edinburgh University Press; Laver, J. (1994), Principles of Phonetics, CUP; O’Connor, J.D. (1973), Phonetics, Penguin. An extensive discussion of RP can be found in Wells, J.C. (1982), Accents of English, CUP, 3 vols. The classic description of the RP accent of English is: Gimson, A. (1962), An Introduction to the Pronunciation of English, London, Edward Arnold

Sounds and meanings

A collection of articles on sound symbolism is: Hinton, L., Nichols, J. & Ohala, J. (1994), Sound Symbolism, CUP, which has a paper by John Ohala putting forward the Frequency Code.

Intonation and its functions

A standard introduction to intonation is: Cruttendon, A. (1986), Intonation, CUP. English intonation in particular is covered in: O'Connor, J.D., & Arnold, G.F. (1973), Intonation of Colloquial English, Longman, 2nd Edition. Rises in New Zealand intonation are described in: Britain, D., & Newman, J. (1992), ‘High rising terminals in New Zealand English’, Journal of the International Phonetic Association, 22, 1-11.

Producing the sounds of speech

The UPSID statistics for different languages are presented in: Maddiesen, I. (1984), Patterns of Speech, CUP. Slovak is described in: Balász, P., Darovec, M., Trebatická, H. (1976), Slovak for Slavicists, Bratislavia: slovernské pedagogické Nakladatel’stvo. Details of English vowels around the world are based on: Wells, J.C. (1982), Accents of English, CUP, 3 vols. Clear diagrams of the RP vowels and consonants are given in: Crystal, D. (1995), The Cambridge Encyclopedia of the English Language, CUP. The sources for tongue-twisters are: Schwartz, A. (1972), A Twister of Twists, a Tangler of Tongues, Lipincott, Philadelphia; Sanquillet, M. & Peterson, E. (eds.) (1978), The Essential Writings of Marcel Duchamps, Thames and Hudson

Air and speech

The original article on VOT is: Liberman, A.M, Cooper, F.S., Shankweiler, D.S. & Studdert-Kennedy, M. (1967), ‘Perception of the speech code’, Psychological Review, 74, 431-461

Combining sounds into syllables

Epenthetic vowels in L2 learners are discussed in: Broselow, E. (1988), ‘Prosodic Phonology and the Acquisition of a Second Language’, in S. Flynn and W. O'Neil (eds.), Linguistic Theory in Second Language Acquisition, Kluwer, Dordrecht

Sounds and phonemes

Estuary English is described in a popular book: Coggle, P. (1993), Do You Speak Estuary?, Bloomsbury, London. English pronunciation exercises can be found in: Baker, A. (1981), Ship or Sheep?. CUP; Hill, L.A. (1961) Drills and Tests in the English Sounds, Longman

Alternatives to speech sounds

The source for drum language is: Carrington, J. (1947), Talking Drums of Africa, London. British Sign Language is described in: Woll, B., Kyle, J. and Deuchar, M. (199X), Perspectives on British Sign Language and Deafness, CUP; Kyle, J.G., and Woll, B. (1985), Sign Language, CUP. Hand gestures themselves are covered in: McNeill, D. (1992), Hand and Mind, University of Chicago. Whistle languages can be found in: Busnel, R.G., and Classe, A. (1976), Whistled Languages, Berlin, Springer; Thomas, A. (1995), ‘Whistled languages’, e-mail summary THOMAS@arts.uoguelph.ca