World Library  
Flag as Inappropriate
Email this Article

Precomposed character

Article Id: WHEBN0000497891
Reproduction Date:

Title: Precomposed character  
Author: World Heritage Encyclopedia
Language: English
Subject: Unicode, Unicode equivalence, , Script (Unicode), Yiddish orthography
Collection: Unicode
Publisher: World Heritage Encyclopedia
Publication
Date:
 

Precomposed character

A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can be defined as a sequence of one or more other characters. A precomposed character may typically represent a letter with a diacritical mark, such as é (Latin small letter e with acute accent). Technically, é (U+00E9) is a character that can be decomposed into an equivalent string of the base letter e (U+0065) and combining acute accent (U+0301). Similarly, ligatures are precompositions of their constituent letters or graphemes.

Precomposed characters are the legacy solution for representing many special letters in various character sets. In Unicode they are included primarily to aid computer systems with incomplete Unicode support, where equivalent decomposed characters may render incorrectly.

Contents

  • Comparing precomposed and decomposed characters 1
  • Chinese characters 2
  • See also 3
  • Sources 4
  • External links 5

Comparing precomposed and decomposed characters

In the following example, there is a common Swedish surname Åström written in the two alternative methods, the first one with a precomposed Å (U+00C5) and ö (U+00F6), and the second one using a decomposed base letter A (U+0041) with a combining ring above (U+030A) and an o (U+006F) with a combining diaeresis (U+0308).

  1. Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D)
  2. Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D)

    Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in all fonts. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters.

    With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructed Proto-Indo-European word for "dog"):

  3. ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E)
  4. ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E)

In some situations, the precomposed green k, u and o with diacritics may render as unrecognized characters, or their typographical appearance may be very different from the final letter n with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.

OpenType has the ccmp "feature tag" to define glyphs that are compositions or decompositions involving combining characters.

Chinese characters

In theory, most Chinese characters as encoded by Han unification and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituent strokes and ideograph descriptions with Chinese character description languages, though Unicode does not take this approach that would certainly be on the cutting edge of text storage and layout. Such an approach could potentially reduce the number of characters in the character set from tens of thousands to just a few hundred. On the other hand, a character set encoded in this way would also produce documents that were tenfold larger in bytes to represent the same characters as Unicode.

See also

Sources

  • The Unicode Standard, Version 5.2: Conformance (see Section 3.7 for Decomposition). The Unicode Consortium, December 2009.
  • Aaron Weiss: Composite and Precomposed Characters. Web Developer's Virtual Library. February 20, 2001.
  • MSDN: Defining a Character Set. April 8, 2010.

External links

  • Free Idg Serif, a derivative of the FreeSerif font with added declarations of precomposed characters.
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
 
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
 
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.
 


Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.