MySQL 5.5 supports these Unicode character sets:
              ucs2, the UCS-2 encoding of the Unicode
              character set using 16 bits per character
            
              utf16, the UTF-16 encoding for the
              Unicode character set; like ucs2 but
              with an extension for supplementary characters
            
              utf32, the UTF-32 encoding for the
              Unicode character set using 32 bits per character
            
              utf8, a UTF-8 encoding of the Unicode
              character set using one to three bytes per character
            
              utf8mb4, a UTF-8 encoding of the
              Unicode character set using one to four bytes per
              character
            
          ucs2 and utf8 support
          BMP characters. utf8mb4,
          utf16, and utf32 support
          BMP and supplementary characters. The
          utf8mb4, utf16, and
          utf32 character sets (and their collations)
          were added in MySQL 5.5.3.
        
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set. For general information about the character sets, see Section 9.1.10, “Unicode Support”.
          A similar set of collations is available for each Unicode
          character set. These are shown in the following list, where
          xxx represents the character set
          name. For example,
          xxx_danish_ciucs2_danish_ci,
          utf16_danish_ci,
          utf32_danish_ci,
          utf8_danish_ci, and
          utf8mb4_danish_ci.
        
              xxx_bin
              xxx_czech_ci
              xxx_danish_ci
              xxx_esperanto_ci
              xxx_estonian_ci
              xxx_general_ci
              xxx_hungarian_ci
              xxx_icelandic_ci
              xxx_latvian_ci
              xxx_lithuanian_ci
              xxx_persian_ci
              xxx_polish_ci
              xxx_roman_ci
              xxx_romanian_ci
              xxx_sinhala_ci
              xxx_slovak_ci
              xxx_slovenian_ci
              xxx_spanish2_ci
              xxx_spanish_ci
              xxx_swedish_ci
              xxx_turkish_ci
              xxx_unicode_ci
          MySQL implements the
          xxx_unicode_cixxx_unicode_ciutf8_unicode_ci for concreteness.
        
          For any Unicode character set, operations performed using the
          xxx_general_cixxx_unicode_ciutf8_general_ci collation are faster, but
          slightly less correct, than comparisons for
          utf8_unicode_ci. The reason for this is
          that utf8_unicode_ci supports mappings such
          as expansions; that is, when one character compares as equal
          to combinations of other characters. For example, in German
          and some other languages “ß”
          is equal to “ss”.
          utf8_unicode_ci also supports contractions
          and ignorable characters. utf8_general_ci
          is a legacy collation that does not support expansions,
          contractions, or ignorable characters. It can make only
          one-to-one comparisons between characters.
        
          To further illustrate, the following equalities hold in both
          utf8_general_ci and
          utf8_unicode_ci (for the effect this has in
          comparisons or when doing searches, see
          Section 9.1.7.7, “Examples of the Effect of Collation”):
        
Ä = A Ö = O Ü = U
          A difference between the collations is that this is true for
          utf8_general_ci:
        
ß = s
          Whereas this is true for utf8_unicode_ci:
        
ß = ss
          MySQL implements language-specific collations for the
          utf8 character set only if the ordering
          with utf8_unicode_ci does not work well for
          a language. For example, utf8_unicode_ci
          works fine for German and French, so there is no need to
          create special utf8 collations for these
          two languages.
        
          utf8_general_ci also is satisfactory for
          both German and French, except that
          “ß” is equal to
          “s”, and not to
          “ss”. If this is acceptable
          for your application, then you should use
          utf8_general_ci because it is faster.
          Otherwise, use utf8_unicode_ci because it
          is more accurate.
        
          utf8_swedish_ci, like other
          utf8 language-specific collations, is
          derived from utf8_unicode_ci with
          additional language rules. For example, in Swedish, the
          following relationship holds, which is not something expected
          by a German or French speaker:
        
Ü = Y < Ö
          The
          xxx_spanish_cixxx_spanish2_ciñ” (n-tilde) is a separate
          letter between “n” and
          “o”. In addition, for
          traditional Spanish, “ch” is a
          separate letter between “c”
          and “d”, and
          “ll” is a separate letter
          between “l” and
          “m”
        
          In the
          xxx_roman_ciI and J
          compare as equal, and U and
          V compare as equal.
        
          For all Unicode collations except the “binary”
          (xxx_bin
              For BMP characters in general collations (for example,
              utf8_general_ci), weight = code point.
            
              For BMP characters in UCA collations (for example,
              utf8_unicode_ci), the following
              algorithm applies:
            
if (code >= 0x3400 && code <= 0x4DB5) base= 0xFB80; /* CJK Ideograph Extension */ else if (code >= 0x4E00 && code <= 0x9FA5) base= 0xFB40; /* CJK Ideograph */ else base= 0xFBC0; /* All other characters */ aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
              The result is a sequence of two collating elements,
              aaaa followed by
              bbbb.
            
              Thus, U+04cf CYRILLIC SMALL LETTER
              PALOCHKA currently is, with all UCA collations,
              greater than U+04c0 CYRILLIC LETTER
              PALOCHKA. Eventually, after further collation
              tuning, all palochkas will sort together.
            
              For supplementary characters in general collations, the
              weight is the weight for 0xfffd REPLACEMENT
              CHARACTER. For supplementary characters in UCA
              collations, their collating weight is
              0xfffd. That is, to MySQL, all
              supplementary characters are equal to each other, and
              greater than almost all BMP characters.
            
              An example with Deseret characters and
              COUNT(DISTINCT):
            
CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */ INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */ INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */ SELECT COUNT(DISTINCT s1) FROM t;
              The result is 2 because in the MySQL Unicode collation,
              the replacement character has a weight of
              0x0dc6, whereas Deseret Bee and Deseret
              Tee both have a weight of 0xfffd. (Were
              the utf32_general_ci collation used
              instead, the result would be 1 because all three
              characters have a weight of 0xfffd in
              that collation.)
            
The current rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by MySQL's rule and secondarily by code point value, it is easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
          The utf16_bin
          Collation
        
          There is a difference between “ordering by the
          character's code value” and “ordering by the
          character's binary representation,” a difference that
          appears only with utf16_bin, because of
          surrogates.
        
          Suppose that utf16_bin (the binary
          collation for utf16) was a binary
          comparison “byte by byte” rather than
          “character by character.” If that were so, then
          the order of characters in utf16_bin would
          differ from the order in utf8_bin. For
          example, the following chart shows two rare characters. The
          first character is in the range
          E000-FFFF, so it is
          greater than a surrogate but less than a supplementary. The
          second character is a supplementary.
        
Code point Character utf8 utf16 ---------- --------- ---- ----- 0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D 10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
          The two characters in the chart are in order by code point
          value because 0xff9d <
          0x10384. And they are in order by
          utf8 value because 0xef
          < 0xf0. But they are not in order by
          utf16 value, if we use byte-by-byte
          comparison, because 0xff >
          0xd8.
        
          So MySQL's utf16_bin collation is not
          “byte by byte.” It is “by code
          point.” When MySQL sees a supplementary-character
          encoding in utf16, it converts to the
          character's code-point value, and then compares. Therefore,
          utf8_bin and utf16_bin
          are the same ordering. This is consistent with the SQL:2008
          standard requirement for a UCS_BASIC collation:
          “UCS_BASIC is a collation in which the ordering is
          determined entirely by the Unicode scalar values of the
          characters in the strings being sorted. It is applicable to
          the UCS character repertoire. Since every character repertoire
          is a subset of the UCS repertoire, the UCS_BASIC collation is
          potentially applicable to every character set. NOTE 11 —
          The Unicode scalar value of a character is its code point
          treated as an unsigned integer.”
        
          If the character set is ucs2, comparison is
          byte-by-byte, but ucs2 strings should not
          contain surrogates, anyway.
        
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).


User Comments
Note for Hebrew speakers: in utf8_general_ci, dots (Niqqud symbols) are treated as seperate characters, so
a. If you have dotted words in your table, they won't be ordered correctly.
b. You can have words with the same letters and different dots in a unique index column.
On the other hand, in utf8_unicode_ci, dots are igonred, so:
a. The order will be correct;
b. Words with the same letters and different dots will be regarded as equal, and you won't be able to have them in a unique index column.
I still didn't find a collation that treats both issues correctly.
Add your own comment.