In general terms, a character is an abstract component of a script, with no visible form, while a glyph is a visible component of a script. For example, a, a are two glyphs both corresponding to the character 'lower case a' in the Latin script. Similar glyphs found in different fonts also correspond to this character.
An element of a script will be defined as an abstract unit of linguistic meaning. A script may be analyzed into a minimal set of elements (vowels, consonants, etc.), such that the linguistic meaning of any text using the script is expressible in terms of these elements.
For example, in older English spelling the vowels e, æ are contrasting elements of Latin script, but for recent English spelling e, æ are equivalent orthography and æ is not needed as an element.
The Unicode meaning of character (i.e. a character encoded in Unicode) is best understood by noting that the following are included as Unicode characters in Indic scripts (it is helpful to remember that Unicode is geared to presentation). (See The Unicode Standard / Version 3.0. Addison-Wesley, 2000)
|Indic Unicode characters|
Matras (vowel forms for combination with consonants).
The second parts of two-part matras.
Bases for Gur. vowel signs.
Two Vedic accents.
Consonants in general.|
Signs affecting the meaning of consonantal characters (e.g. Virama, and Gur. Adhak).
A few consonantal clusters (most are formed as combinations of characters).
Formatting characters forming special combinations of consonants (e.g. for Nepali Urpha).
Characters with diacritics for use in various languages.
Numerical and currency characters.
Punctuation and similar signs.
|Typographical symbols representing words.
The rules for rendering strings of Unicode characters as glyphs may be shown using the following notation: a Unicode character is denoted by square brackets around a representative glyph of the character; for reference, a non-Unicode character is denoted by curly brackets around a glyph.
The formatting character [ZWJ] is 'zero width joiner', and [ZWNJ] is 'zero width non-joiner'. The following list of basic rendering rules also shows a few non-Unicode characters:
In Indic scripts, most glyphs of consonant clusters correspond to several combined Unicode characters, and not to a separate Unicode character.
A few glyphs with nuqtas are also Unicode characters. Thus:
Because of the existence of different fonts and styles, and also because of alternative glyphs, the glyphs shown above are not unique. This may be indicated schematically by:
Glyphs cannot be the basis of transliteration, because some are merely alternatives and others are two or more combined consonants. Characters are also unsuitable: formatting characters, characters which are diacritics or part of a matra, and every instance of the Unicode character Virama, are not suitable for transliteration, nor would it be proper to have a special transliteration for each non-Unicode character for combined consonants.
In a specimen of text, the string of glyphs corresponds to the surface structure of the script. The meaning of the glyphs corresponds to the deep structure of the script. The deep structure is therefore expressible using a minimal set of script elements. Transliteration is best applied to this set, or a subset of it.
Last updated: 10 June 2002