-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: upgrade mysql charset and collation to utf8mb4 #1065
feat!: upgrade mysql charset and collation to utf8mb4 #1065
Conversation
b203c94
to
938abca
Compare
0e17056
to
80796da
Compare
47f01f9
to
438b312
Compare
changelog.d/20240520_150131_danyal.faheem_upgrade_mysql_utf8mb4.md
Outdated
Show resolved
Hide resolved
docs/local.rst
Outdated
|
||
Your database's charset and collation might not support specific characters or emojis. To upgrade your charset and collation of all the tables in your database to utf8mb4 and utf8mb4_unicode_ci respectively, run:: | ||
|
||
tutor local do change-charset-collation --all-tables --charset=utf8mb4 --collation=utf8mb4_unicode_ci | ||
|
||
Alternatively, if you only want to upgrade certain tables or exclude certain tables, you can use the `status` option. To upgrade the `courseware_studentmodule` and `courseware_studentmodulehistory` tables, run:: | ||
|
||
tutor local do change-charset-collation --status=include courseware_studentmodule courseware_studentmodulehistory | ||
|
||
Tutor performs pattern matching from the start of the table name so you can just enter the name of the app to include/exclude all the tables under that app. To upgrade all the tables in the database except the ones under the student and wiki apps, run:: | ||
|
||
tutor local do change-charset-collation --status=exclude student wiki |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should specify that the change is not reversible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a warning note at the top.
docs/local.rst
Outdated
|
||
Your database's charset and collation might not support specific characters or emojis. To upgrade your charset and collation of all the tables in your database to utf8mb4 and utf8mb4_unicode_ci respectively, run:: | ||
|
||
tutor local do change-charset-collation --all-tables --charset=utf8mb4 --collation=utf8mb4_unicode_ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which situation(s) will users have to run this command? Is it for users upgrading from Quince? Is it optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they can run it after upgrading from Quince. It is optional as tutor runs fine without it unless you add specific characters such as emojis that will throw DB errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That explanation should be present in the docs :)
tutor/commands/jobs.py
Outdated
context_settings={"ignore_unknown_options": True}, | ||
) | ||
@click.option("--all-tables", is_flag=True, help="change all tables in the openedx database") | ||
@click.option("-s", "--status", type=click.Choice(['include', 'exclude'], case_sensitive=False), help="Whether to include or exclude certain tables/apps") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This combination of options is not very intuitive. It means that the end users needs to specify at least one option -- and that is not a great pattern, as options are considered... optional :)
Instead, I suggest that:
- all tables are process by default.
- If
--include
is specified, then only those tables are processed. - If
--exclude
is specified, then those tables are not processed.
Note that with this approach, --include
and --exclude
are mutually compatible (--exclude
would override --include
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all tables are process by default
I did not want to make this the default scenario in case someone accidentally executes the command while trying out the different options provided and they did not want to upgrade all the tables since this is a potentially irreversible change. Upgrading all tables at once can also take a long time.
Maybe we can prompt the user if they wish to proceed?
Note that with this approach, --include and --exclude are mutually compatible (--exclude would override --include).
I think this can be achieved and sounds more plausible. I'll update the options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can prompt the user if they wish to proceed?
Yes, this sounds perfect. Please make sure to include a -I/--non-interactive
option for headless scripts.
tutor/commands/upgrade/common.py
Outdated
|
||
SET FOREIGN_KEY_CHECKS = 0; | ||
Select _table_name; | ||
SET @statement = CONCAT('ALTER TABLE ', _table_name, ' CONVERT TO CHARACTER SET {charset} COLLATE {collation};'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this change columns that currently use utf8mb4_bin
collation to be utf8mb4_unicode_ci
? If so, that's a problem. The relatively small number of columns that use utf8mb4_bin
are usually doing so because they explicitly want the values to be case sensitive, because case insensitivity on things like course keys has given us a lot of grief in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify that last statement, an example was that MongoDB and the Python layer of CourseKeys are case sensitive, but the key stored in our CourseOverview model is not case sensitive. So at one point we had two course runs in production that had keys that differed only by case (because someone had misspelled when doing the initial creation), and updates to one could clobber the values for the other because they were fighting over the same row in CourseOverview for updates. (The inactive course run was still getting automated push updates from an external system.)
I realize that fixing CourseOverview is well outside the scope of this work, and I'm not trying to suggest we tackle that here. I only put it out there as an example of why apps might want to explicitly specify a sensitive field for storing things like identifier keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this change columns that currently use utf8mb4_bin collation to be utf8mb4_unicode_ci?
It seems this is the case. I think we can add a condition that columns with the utf8mb4_bin
collation should not be upgraded. Any other specific charsets or collations that we should look out for here?
Or we can just upgrade all the columns that are using utf8mb3_general_ci
so that if some other column has their charset/collation fixed by the model in the future, we don't have to update the check for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your findings, it seems like the only forced collations that are not utf8mb4_unicode_ci
are:
utf8mb3_bin
: This should be mapped intoutf8mb4_bin
, since that's the logically equivalent case-sensitive collation.utf8mb4_bin
: This should be preserved as-is, as discussed.utf8mb4_general_ci
: This would be mapped toutf8mb4_unicode_ci
, but it looks like it only exists for thebundles
app, and that app is unused in Redwood and removed entirely just after the Redwood cut, so it's fine to ignore or update it, depending on what's more convenient in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the command to perform the following mappings instead:
utf8mb3_general_ci
toutf8mb4_unicode_ci
utf8mb3_*
toutf8mb4_*
meaning that theutf8mb3_bin
collations are converted toutf8mb4_bin
now.
utf8mb4_bin: This should be preserved as-is, as discussed.
We ignore any columns that already have the utf8mb4
charset.
utf8mb4_general_ci: This would be mapped to utf8mb4_unicode_ci, but it looks like it only exists for the bundles app, and that app is unused in Redwood and removed entirely just after the Redwood cut, so it's fine to ignore or update it, depending on what's more convenient in the code.
Yup, trying it out on the nightly release, the bundles app was not there. Therefore, I've ignored it.
tutor/commands/upgrade/common.py
Outdated
LEAVE tables_loop; | ||
END IF; | ||
|
||
SET FOREIGN_KEY_CHECKS = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this cause any issue when trying to do a partial migration that includes the course_overviews_courseoverview
table? That's the only table I'm aware of that has a varchar primary key and foreign keys from other tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this out by upgrading only the course_overviews_courseoverviewtab
table which has a foreign key to course_overviews_courseoverview
to utf8mb4 and keeping the rest at utf8mb3.
The LMS and CMS seemed to be working fine for me. I tested out the following things:
- Create new Course
- Publish changes in new course
- Enroll in new course and verify everything is working fine
- Edit an existing course
- Publish new course and verify changes can be seen by students on LMS
5cadc9e
to
73b3d38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update PR description and add link to the 2nd PR that contains do command work.
fixes #938.
Changes
utf8mb4
and the default collation toutf8mb4_unicode_ci
.dosqlshell
command to useutf8mb4
as the default character set.Testing
tutor [dev|k8s] launch
Future Work
A new do command
tutor local do change-charset-collation
to allow users to upgrade the charset and collation of their tutor installations will be added in a later release of Tutor. A WIP PR is already up at #1079.