Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zc_install can result in "mixed collations" #6664

Open
lat9 opened this issue Aug 17, 2024 · 7 comments · May be fixed by #6785
Open

zc_install can result in "mixed collations" #6664

lat9 opened this issue Aug 17, 2024 · 7 comments · May be fixed by #6785

Comments

@lat9
Copy link
Contributor

lat9 commented Aug 17, 2024

  1. Performing an initial install, I created an empty database using utf8_general_ci collation. The zc_install process completed successfully, but all tables/table-fields were created with utf8mb4_general_ci collation.
  2. Performing an upgrade from a zc139h database (no 'DB_CHARSET' in the original configure.php files, since that was introduced in zc150). The database's base collation is latin_general_ci and all tables added by the zc_install process use utf8mb4_general_ci collation. The 'DB_CHARSET' is successfully added to the updated configure.php files, with a value of 'utf8mb4'.

This will lead to potential collation-mismatch errors in the future, since additional tables added will inherit the database's default collation, which is not necessary reflective of the 'DB_CHARSET' value.

@lat9
Copy link
Contributor Author

lat9 commented Aug 17, 2024

Upgrading a zc157 database with a utf8mb4_unicode_520_ci base collation; all existing table/table-fields also use that collation. All new tables added by zc_install use utf8mb4_general_ci.

@drbyte
Copy link
Member

drbyte commented Aug 17, 2024

We could go various directions with this.

But before considering that, I'll point out that the whole reason for mb4 is to support multibyte characters (like emojis) so that end-user content doesn't end up with ¿ characters all over their db and the customer's browser.

Thus ... we could begin with documenting that people should choose the mb4 variant wherever possible.

And we could add an alert in zc_install recommending that they update the database's default collation to a matching variant.

@drbyte
Copy link
Member

drbyte commented Aug 17, 2024

As for the 520 vs general variant, I've read mixed reviews online about whether to use 520. The challenge is in determining which one to align with, as not all mysql installs support all known collations. And mariadb has differences as well. And available options are determined not only by the selected engine but also the version installed ... and the host server's own installed capabilities.

IIRC that's why zc_install leans toward general

@lat9
Copy link
Contributor Author

lat9 commented Aug 17, 2024

IMO, zc_install should:

  1. Use the database's default collation for any newly-added tables.
  2. Report to the updating admin if the DB_CHARSET setting doesn't match the database's default collation.
  3. On an initial install, set the DB_CHARSET to latin1, utf8 or utf8mb4 based on its finding of that default collation.

@torvista
Copy link
Member

As I've jumped around hostings over the years, most if not all defaults were latin1, no doubt from hosting inertia and not any conscious decision.
Is that not the most inappropriate setting in this day and age of utf-8, and would only result in problems/have to be changed to utf-8 further down the line?

Assuming that most if not all users who do an install (implying a new start) are novices and would not know a collation from a bar of soap, would it not be better to ignore the db default and force utf-8 charset/collations from day one?

@lat9
Copy link
Contributor Author

lat9 commented Aug 17, 2024

As I've jumped around hostings over the years, most if not all defaults were latin1, no doubt from hosting inertia and not any conscious decision. Is that not the most inappropriate setting in this day and age of utf-8, and would only result in problems/have to be changed to utf-8 further down the line?

Assuming that most if not all users who do an install (implying a new start) are novices and would not know a collation from a bar of soap, would it not be better to ignore the db default and force utf-8 charset/collations from day one?

Agreed on the 'install', it's just a matter of zc_install (a) finding the most appropriate utf8-type collation, (b) setting the database's base collation to that and (c) registering either 'utf8' or 'utf8mb4' as the DB_CHARSET (based on the code's findings).

@drbyte
Copy link
Member

drbyte commented Sep 28, 2024

IMO, zc_install should:

  1. Use the database's default collation for any newly-added tables.
  2. Report to the updating admin if the DB_CHARSET setting doesn't match the database's default collation.
  3. On an initial install, set the DB_CHARSET to latin1, utf8 or utf8mb4 based on its finding of that default collation.

I have a strong reluctance to carry latin1 forward on upgrades. I think we should HALT the upgrade and refuse to proceed if latin1 is found. Forcing them to upgrade their database content first.

Otherwise if we just accept their latin1 data and try to babysit it forward into the future, support will be difficult because everything in the world uses utf8mb4 (or in some cases the lesser mb3, albeit that has its problems, and mb4 should be used).

So, yes, I'd agree with point 2, insomuch as if latin1 is detected, we just tell them that and so go grab the conversion utility and run it.

There's the added complication that most end-users do NOT have the ability to ALTER DATABASE and change the database's defaults. They "might" have a switch in their cPanel/whatever, but most likely need to ask tech support to change it for them. Therefore zc_install cannot be counted on to change it, and it's a false sense of security to offer it when the only place it would be able to do it is on dev's offline localhost databases, not on live servers.

If the user doesn't get that default collation/charset updated, then no matter how hard we try to prevent mixed collations, we can't. People can create new tables and new columns without specifying a collation, and in that case the database's default will be used. And if that's still latin, there's nothing we can do about that.
I suppose we could try to make Zen Cart's query_factory reverse-parse and inspect every query for any CREATE/ALTER statements and inject/override collations to babysit a store using latin1 ... by why are we coddling something that's all-but-deprecated?

Better to HALT and force conversion, and then carry on using meaningful defaults as it does now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants