Introduction

Over the years, I've gradually added extra TTF (or OTF) fonts to my linuxfromscratch desktops - I gave up on the traditional xorg fonts several years ago. I get annoyed if my browser shows a box with dots instead of rendering a character (actually, the “dots” are the value of the codepoint), even if I have no chance of understanding it.

I had certain fonts which I used to use, and others that I'd installed “just in case”. But then I started using libreoffice - that meant that I added the Liberation fonts (they are its default for text in english, at least in lo-3.6), and I realised that I had far too many fonts - the font selector took more than two screens of height to scroll through all of them.

Fortunately, there are some perl tools: Font-TTF (needs IO-String) and a script (ttf2config.pl) which is shipped in the Examples/ directory of Font-TTF-Scripts. These allowed me to find what codepoints are available in TTF and OTF fonts. You can find my own scripts at linuxfromscratch.org/~ken

In general, the following things interest me, and are likely to trigger the "missing glyph" box: languages, [ on wikipedia this often means that IPA symbols are also present ], and any news reports which are generally available. For me, news means whatever is available at the BBC and at google. I'll list them at the end as a guide to what is available for testing your fonts.

Note that I have not attempted to look at historic writing systems, and the non-text parts of unicode (number forms, arrows, mathematical symbols, etc) are of no interest to me. I also don't attempt to show which combining diacritical marks ('accents') are available in a font.

There are three sudrirectories here:

example-pdfs/

This has a PDF for each of the fonts : 0-9, latin (english) alphabet, IPA if present [in reality, a lot of what is assigned to “latin” is actually used for phonetics, or is historic, and a few “IPA” codepoints such as schwa and ezh are used in some alphabets], greek alphabet if present, cyrillic (russian) alphabet if present, diacriticals and extra letters where applicable, other (current) writing systems.

There are also two CJK PDFs showing some glyphs which differ between the variants (simplified chinese, traditional chinese, japanese, korean) for all the CJK fonts I have.

Where a font covers a small random selection of extra latin or cyrillic codepoints, I have not shown those because I assumed that the font was not particularly useful for general latin or cyrillic languages. Often, those glyphs appear to have been carried over from some other font.

I first noticed this on some fonts which definitely were not designed to support latin or cyrillic. More recently, I've seen a note (didn't keep the URL) that the latin and greek glyphs within CJK fonts appear to be completely useless (bad sizes, mix of monspaced and variable width).

However, if a font has gone to the trouble of including a large number of extra latin or cyrillic codepoints I attempt to show all of them. There is probably some inconsistency in where I have placed variant letters, particularly those from the cyrillic alphabet.

To save space, I have only included lowercase variants of diacriticals and extra letters and ligatures. This has the side-effect that dotless i is included as a variant of 'i' and dotted I is not shown - in general, if a font has dotless ı, and both s and t with cedilla, then it will have dotted Ỉ for Turkish. Also, I ignore most non-current-language and non-language blocks, so nothing in supplemental plane 1 is shown.

The only digits I have included are 0-9 - indic and arabian fonts will usually include their appropriate digits. Any glyph described as some sort of sign is ignored. Where I noticed that vowel or tone markings combine with an earlier letter, I ignore them. Old writing systems are also ignored, with the exception of polytonic greek (still used by some people), syriac (used for Assyrian Neo-Aramaic) and coptic which might still be used in religious writing. Braille is also ignored because my PDFs have no texture.

CJK codepoints are a minefield - almost nobody will ever need the majority of them. To keep them manageable, I show only the codepoints which wikipedia thinks will differ between the various scripts (simplified chinese, traditional chinese, japanese, korean).

To help me create these documents I used gucharmap : by right-clicking on a glyph it will tell you which font it is displaying it from. That might not be the same as the font you selected - that's the wonder of fontconfig which will substitute from another font if the glyph is not in the one you chose.

Unfortunately, LibreOffice-4.0.2 has slightly changed its font handling - if I mark a block of text to be a specific font, it no longer changes the font name in the status when the cursor is at a glyph which that font does not support. Sometimes, the difference is obvious (e.g. latin glyphs only in a Freefont face), but it is possible that some errors may have crept in because of this.

ttf-otf-coverage/

This has a file for each font listing the available glyphs within their unicode blocks.

I only ever list the regular versions, not bold or italic. To understand what is covered, consult the code charts at http://www.unicode.org - note that an apparent gap in the ranges does not necessarily mean that something has not been included - for various reasons certain codepoints have not been assigned by the unicode consortium.

ttf-otf-glyphs/

This has a line by line listing of the glyphs (codepoints) that a font supports. These are included so that you can, if you wish, grep for a specific codepoint to see if it is in a font.

Omissions

There are probably thousands of unicode fonts which I have not listed because I already have a font which will render the glyphs. But there are some other writing systems which I have ignored: in general, anything no longer in use, e.g. Tagalog (Baybin) which died out by the 19th century. The following are worth mentioning:

Mongolian Script - the cyrillic alphabet is usually used. Although Mongolian Script has been in Unicode since 1999, it is written in vertical columns. There is a font at mongolfont.com which includes Roman numerals and circled numerals rotated through -90° but I know of no way of preparing vertical paragraphs. There is pastable example text at unicode.org but it reads from left to right.

Cham Script, Lepcha (Róng), Limbu, Lisu, Mandaic, New Tai Lue, Ol Chiki, Chakma : I cannot find any pastable text in these writing systems. For Chakma there is a RibengUni font available from hilledu.com, for the others I have not attempted to find fonts.

Note: web pages for testing -

The BBC pages with news in different languages are at

http://www.bbc.co.uk/worldservice/languages/index.shtml : arabic, azeri, bangla (bengali), burmese - The title on the firefox tab doesn't render, but the content does, cantonese <zhongwen/trad> chinese <zhongwen/simp>, french, hausa, hindi, indonesian, kinyarwanda <gahuza, kirundi>, kyrgyz, nepali, pashto, persian, portuguese (brazilian), russian, sinhala, somali, spanish, swahili, tajik (linked from kyrgyz and uzbek pages but not on the index), tamil, turkce (turkish), ukrainian, urdu, uzbek, vietnamese.

And of course there is http://news.google.com - including czech, german, spanish, french, italian, hungarian, dutch, norwegian, polish, swedish, greek, serbian (both alphabets), israeli, arabic, indian (hindi, tamil, telugu, malayalam), korean, chinese (simplified, taiwanese traditional, hong kong).

Also, there is example text at http://unicode.org/udhr and in many language-related pages at wikipedia - use your native language's wiki!

For other font sites, you may wish to look at http://www.alanwood.net and http://www.wazu.jp.

Erratum

Please note that some of the example letters in the next paragraph will not render unless you already have a suitable font.

I've treated latin letters from a North-West European perspective, so o-stroke or o-slash (ø) is listed as a different letter, but o-horn (ơ) is shown among the diacritical variants of o, similarly for u-horn (ư). I'm sure there are also several weirdnesses among how I've grouped some of the cyrillic variations - language-specific variants are deliberately included with the diacritical, e.g. abkhasian ha (ҩ) is shown as a variant of cyrillic ha (х), but I think I've sometimes wrongly shown abkhasian dze (ӡ) as a variant of cyrillic ze (з) instead of as a variant of cyrillic (macedonian) dze (ѕ); - the latter is identical to latin s.

ChangeLog

2013-02-28 Initial upload, baekmuk, bkai00mp, bsmi00lp, cantarell, Charis-SIL, CJK examples, DejaVu, fireflysung, FreeMono, kochi, bitstream-vera.

2013-03-04/5 First versions of this explanatory html.

2013-03-08 Minor text changes, place the link to example-pdfs first.

2013-03-10 Redo the PDFs to mention version numbers. Add example files for FreeSans, FreeSerif, gbsn00lp, gkai00mp, ukai, uming and correct the kochi examples (the names were swapped). Remove unnecessary directories where the package name can be derived from the font name.

2013-03-15 Add Liberation and Lohit fonts.

After this, it became apparent that for some fonts I was not listing all of the glyphs. This is almost certainly a bug in how I wrapped ttf2config.pl, but the details of what was wrong eluded me. While I was looking around, I discovered that Font:TTF:Scripts includes a "fret" program which will produce a PDF listing of everything within a TTF font.

I was sorely tempted to use this to show the contents of the fonts, but when I tried it on a CJK font the resulting PDF was large - too big to fit in my free webspace, and even if I had purchased space it would incur a fortune in bandwidth costs when people downloaded it. So, I tried using that to get a list of the glyphs (those PDFs are composed of text which can be parsed - that isn't true for the language PDFs I mention below). After a good start, it soon became apparent that some other glyphs were occasionally missing. Also, fret cannot understand OTF files. So, in the end, for OTF files I'm continuing to use my first process. For TTF files I'm using both processes to try to catch everything - I still don't think it's perfect, but it seems close enough.

I also discovered that some of my scripting to assign glyphs to the relevant Unicode blocks, and to name the blocks, had a few errors in it. I've fixed that, and recreated the "coverage" files.

Then I spent some time looking at the TTC fonts (these work in fontconfig/freetype, but I hadn't been able to analyze them) and made a breakthrough - if I build fontforge, I can load a TTC into it, select one of the fonts within the TTC, and save that as a TTF (this usage doesn't need the freetype source code to be available when compiling fontforge). I can then feed that to my normal scripts. This means I will be able to produce lists of the glyphs and coverage for UKai and UMing.

It also occurred to me that while listing all of the glyphs is fine, it doesn't give a feel for how text will actually look. It seems to be conventional to show the text from Article 1 of the Universal Declaration of Human Rights, so that is what I've done. My choice of languages is random, particularly trying to show some of the less common glyphs.

Unless noted, I've taken the text from http://unicode.org/udhr because that is pastable. See also research.ics.aalto.fi for PDFs of other languages - the primary source is the PDFs at The Office of the High Commissioner for Human Rights (OHCHR).

The problem with a PDF, in this context, is that it will use an unknown font so the only way to get the text is to transcribe it - that means you need to be able to understand its format (for some of the writing systems I cannot identify where the preamble ends), and to be able to identify the glyphs. Unfortunately, I cannot parse the contents of these PDFs. And for some writing systems there does not seem to be any translation of Article 1. This means I cannot show text for the following writing systems:

If you are interested, you might also find text at www.omniglot.com. The examples there are graphics, so that you can see them in a browser which doesn't have the necessary fonts. Whenever I've tried to transcribe them for non-european text I've encountered glyphs which do not match any of the available glyphs in my fonts (or, in one case, a glyph which only matched a symbol for "one-half"). Most of this is perhaps down to different forms, such as the two common forms of lowercase G in latin alphabets (g vs g).

At a late stage, I found www.geonames.de which has some text in other languages.

2013-06-02 Uploaded these changes, removed subdirectories.

2013-06-05 Added Amharic example text to FreeSerif PDF. Added the luxi fonts, Jomolhari, KhmerOS, NuosuSIL. Added link to my own scripts. Added Padauk-Book.

2013-07-17 Added CJK-Examples-2 PDF (the odosung font has the same internal name as fireflysung) and details of its fonts: odohei, odokai, odosung, odosung-mono. Also added SourceCodePro, SourceSansPro, Ubuntu, UbuntuMono.

As of June 2013, the listed versions were current so this document is now complete.

Ken Moffat, 2013-07-17. E&OE