Best collation to use in web development with php and mysql

There are very little common things which developer easily skip and choose defaults options. It will create not a problem immedietly but you have to dig your head someday. Recently i faced a same challange with one of my sql query returning wrong results while comparing some hindi phrases.

After hours of mind juggling i finally found this is a problem of mysql collation. A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations. A collation orders characters based on weights.

Now if you see while creating a Mysql Database there is long list of collations available. on php my admin mostly default selected as
utf8_general_ci, which i had selected while developing my web application. utf8_general_ci is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. So when i start storing content is unicode (hindi) it start giving wrong results. So i switched to utf8_unicode_ci .

utf8_general_ci is somewhat faster than utf8_unicode_ci, but less accurate (for sorting). The specific language utf8 encoding (such as utf8_swedish_ci) contain additional language rules that make them the most accurate to sort for those languages. Most of the time I use utf8_unicode_ci (I prefer accuracy to small performance improvements), unless I have a good reason to prefer a specific language.

You can read more on specific unicode character sets on the MySQL manual – http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

6 thoughts on “Best collation to use in web development with php and mysql

  1. Georges Jean-Denis

    Why not use “utf8mb4_unicode_ci”, the newer “utf8_unicode_ci” ?
    I actually agree with you: “I prefer accuracy to small performance improvements”.

    https://www.google.com/search?q=utf8+vs+utf8mb4&oq=utf8+vs+utf8mb4&aqs=chrome..69i57.6752j0j7&sourceid=chrome&ie=UTF-8

    https://www.google.com/search?q=utf8_unicode_ci+vs+utf8mb4_unicode_ci&sxsrf=ALeKk03dqwj51ZYHOdf0WgwpxB3gxhi_Rg%3A1618942694131&ei=5hp_YK-zB8Gy5NoP49KO0AI&oq=utf8_unicode_ci+vs+utf8mb4&gs_lcp=Cgdnd3Mtd2l6EAEYADIFCAAQywEyBQgAEMsBMgYIABAIEB4yBggAEAgQHjIGCAAQCBAeMgYIABAIEB4yBggAEAgQHjoHCCMQsAMQJzoHCAAQRxCwAzoHCAAQsAMQQzoICAAQBxAeEBM6CggAEAgQBxAeEBNQhVZYsllg5YsBaAJwAngAgAHfAYgBxwOSAQUwLjIuMZgBAKABAqABAaoBB2d3cy13aXrIAQrAAQE&sclient=gws-wiz

    Reply
  2. Georges Jean-Denis

    Why not use “utf8mb4_unicode_ci” instead of “utf8_unicode_ci” ?
    Because it is newer, you should guess that there is an advantage.

    Reply
  3. Georges Jean-Denis

    Excuse me, I’m a newbie at this. But why not use “gbk_bin” instead of “utf8_unicode_ci” ?

    Reply
  4. Georges Jean-Denis

    Excuse me, I’m a newbie at this. But why not use “ascii_bin” instead of “utf8_unicode_ci” ?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *