Posted 2008-04-12 15:50. Tagged characters, markup, unicode, hyphen.
There is a lot of dashes and hyphenation marks in Unicode. Here’s the more important ones and when to use them — and how your browser handles them.
This is the plain ASCII dash character. This character has rater ambigous semantics, as it might be used instead of any of the other hyphens, dashes, or minus signs.
This character should be used only as an ASCII fallback when it is not practical to use the more specific characters listed below.
Example: Dorem-ipsum-dolor-sit-amen the-quick-brown-fox-jumps-over-the-lazy-dog super-cali-fragi-listic-expiali-docious the-quick-brown-fox-jumps-over-the-lazy-dog Dorem-ipsum-dolor-sit-amen super-cali-fragi-listic-expiali-docious
Soft hyphen (U+00AD)
This character is either the glyph to insert where a soft hyphenation occurs, or a character to mark the place where a soft hyphen might occur. In HTML, the latter semantics is prefered. As such, it is non‐visible, except where a hyphenation actually occurs, where it is rendered with the same glyph as the hyphen‐minus.
Example: Doremipsumdolorsitamen thequickbrownfoxjumpsoverthelazydog supercalifragilisticexpialidocious thequickbrownfoxjumpsoverthelazydog Doremipsumdolorsitamen supercalifragilisticexpialidocious
In serious typography, a soft hyphen isn’t enough.
Instead, a database of words, where they may be hyphenated, and the
cost for each hyphenation point is used.
Rough edges or long spaces also have a cost, and to find where to
hyphenate, the “ugliness cost” for the paragraph is minimized.
Framemaker and TeX uses variants of this, I
don’t know about MS Word or FOP.
This hyphen is allways visible, and lines can break after it. Kind of like a non‐breaking hyphen followed by a zero‐width space. In most cases when you put a dash in a word, this is what should be used.
Example: Dorem‐ipsum‐dolor‐sit‐amen the‐quick‐brown‐fox‐jumps‐over‐the‐lazy‐dog super‐cali‐fragi‐listic‐expiali‐docious the‐quick‐brown‐fox‐jumps‐over‐the‐lazy‐dog Dorem‐ipsum‐dolor‐sit‐amen super‐cali‐fragi‐listic‐expiali‐docious
Non‐breaking hyphen (U+2011)
As it’s named, this is the same character as the other hypen, but lines are not allowed to be broken after it. Just lika an ordinary letter.
Example: Dorem‑ipsum‑dolor‑sit‑amen the‑quick‑brown‑fox‑jumps‑over‑the‑lazy‑dog super‑cali‑fragi‑listic‑expiali‑docious the‑quick‑brown‑fox‑jumps‑over‑the‑lazy‑dog Dorem‑ipsum‑dolor‑sit‑amen super‑cali‑fragi‑listic‑expiali‑docious
The em dash (U+2014) is used for insertions — like this — in text. The em dash is one em unit long, i.e. as long as the line is high.
There is also an en dash (U+2013), which is as long as an n is wide, or half as long as the em dash.
In Swedish, a slightly shorter dash, ¾ em, is used as the em dash above. I don’t know a proper unicode character for this, so I often use an en dash (U+2013) instead.
Dashes and numbers
For ranges, use the en dash (U+2013).
Days per month: 28 – 31.
In text, the word
to should be preferred.
There is 28 to 31 days in a month.
As a number separator, use the figure dash (U+2012). My phone number is 08‒656 92 02.
There is a specific minus sign (U+2212). 17 − 9 = 8.
So, can all these nice dashes be used in ordinary web pages?
- Firefox 3.0 (beta) handles all the dashes correctly (the hyphen‐minus is treated like the hyphen).
- Earlier Firefox makes correct glyphs for all the dashes, but doesn’t allow line breaking after any (unless there is a space there as well). The soft hyphen is ignored, the hyphen and hyphen‐minus is just like the non‐breaking hyphen.
- MSIE 7 (on Windows XP) handles all the dashes correctly (the hyphen‐minus is treated like the hyphen).
If you watch this page in another browser, please tell me how it handles the dashes.
Note: The exact length relations of the plain dash, en dash, figure dash, and minus sign varies between different platforms and fonts. This is not an error. When talking about typographic glyphs, there is also ¼ em, ½ em, ¾ em, 1 em, and 1¼ em dashes. And possibly some more. I would like to have at least the ¾ em dash in unicode, but I can’t find any.