There are very little common things which developer easily skip and choose defaults options. It will create not a problem immedietly but you have to dig your head someday. Recently i faced a same challange with one of my sql query returning wrong results while comparing some hindi phrases.
After hours of mind juggling i finally found this is a problem of mysql collation. A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations. A collation orders characters based on weights.
Now if you see while creating a Mysql Database there is long list of collations available. on php my admin mostly default selected as utf8_general_ci,
which i had selected while developing my web application. utf8_general_ci
is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. So when i start storing content is unicode (hindi) it start giving wrong results. So i switched to utf8_unicode_ci .
utf8_general_ci
is somewhat faster than utf8_unicode_ci
, but less accurate (for sorting). The specific language utf8 encoding (such as utf8_swedish_ci
) contain additional language rules that make them the most accurate to sort for those languages. Most of the time I use utf8_unicode_ci
(I prefer accuracy to small performance improvements), unless I have a good reason to prefer a specific language.
You can read more on specific unicode character sets on the MySQL manual – http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
Why not use “utf8mb4_unicode_ci”, the newer “utf8_unicode_ci” ?
I actually agree with you: “I prefer accuracy to small performance improvements”.
https://www.google.com/search?q=utf8+vs+utf8mb4&oq=utf8+vs+utf8mb4&aqs=chrome..69i57.6752j0j7&sourceid=chrome&ie=UTF-8
https://www.google.com/search?q=utf8_unicode_ci+vs+utf8mb4_unicode_ci&sxsrf=ALeKk03dqwj51ZYHOdf0WgwpxB3gxhi_Rg%3A1618942694131&ei=5hp_YK-zB8Gy5NoP49KO0AI&oq=utf8_unicode_ci+vs+utf8mb4&gs_lcp=Cgdnd3Mtd2l6EAEYADIFCAAQywEyBQgAEMsBMgYIABAIEB4yBggAEAgQHjIGCAAQCBAeMgYIABAIEB4yBggAEAgQHjoHCCMQsAMQJzoHCAAQRxCwAzoHCAAQsAMQQzoICAAQBxAeEBM6CggAEAgQBxAeEBNQhVZYsllg5YsBaAJwAngAgAHfAYgBxwOSAQUwLjIuMZgBAKABAqABAaoBB2d3cy13aXrIAQrAAQE&sclient=gws-wiz
Why not use “utf8mb4_unicode_ci” instead of “utf8_unicode_ci” ?
Because it is newer, you should guess that there is an advantage.
Why not use “utf8mb4_unicode_ci” instead of “utf8_unicode_ci” ?
Why not use “macroman_bin” instead of “utf8_unicode_ci” ?
Excuse me, I’m a newbie at this. But why not use “gbk_bin” instead of “utf8_unicode_ci” ?
Excuse me, I’m a newbie at this. But why not use “ascii_bin” instead of “utf8_unicode_ci” ?