Unicode® Standard Annex #44: Unicode Character Database / Section 5.6 (overview, see the respective sections in the Unicode standard itself).More options might be supported in the future. However, in Turkic languages, it's the letter İ: "i".upcase(:turkic) # => "İ"Īlthough Ruby supports special local case mapping rules, as of Ruby 2.5.1, only :turkic is supported. For example, in most languages, the uppercase version of letter i is I: "i".upcase # => "I" This is already much better than before, however, keep in mind that case-mapping is a locale-dependent operation! Not all languages use the same rules for converting between lower- and uppercase. The old, ASCII-only behavior can be achieved by passing the :ascii option: "ä".upcase(:ascii) # => "ä" This has been fixed and more recent versions of Ruby are able to do this out of the box: "ä".upcase # => "Ä" Up until Ruby 2.3, string methods like #upcase, #capitalize, #downcase, or #swapcase would just not work with non-ASCII characters: "ä".upcase # => "ä" # Ruby 2.3 Unicode® Technical Standard #39: Unicode Security MechanismsĪnother Unicode topic is converting a word from lowercase to uppercase or vice versa.The record holder is LATIN SMALL LETTER O which is currently linked to 75 other characters that it could be confused with: Special Case: Visual Confusable CharactersĮven in normalization form, there are characters which look very similar (sometimes even identical): Codepoints A Unicode® Standard Annex #15: Unicode Normalization Forms.See the standard and documentation for more details, including the differences between the normalization forms: Often this resolves to a single character: "\u Glyph: The actual rendered shape which represents the grapheme clusterĬodepoints are the base unit of Unicode: It is a number mapped to some meaning.Grapheme cluster: Smallest linguistic unit, a user-perceived character, constructed out of one or multiple codepoints.Depending on the encoding, a codepoint might require multiple bytes. Often this maps directly to a single character. Codepoint: A base unit to construct characters from.We will need some more fine-grained concepts to distinguish and talk about characters in Unicode: Is DŽ a single character or not? What about non-Latin languages? The standard defines a lot of things related to characters, however, it is not always easy to grasp what a character actually is. Unicode has come a long way and is now available in version 13.0 ( core specification). ⑩ Unicode Characters You Should Know About as a ?? …or just watch my talk from Rub圜onf 2017: Read on if you want to learn more about important Unicode fundamentals and how to use them in Ruby… Ruby comes with good support for Unicode-related features.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |