Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: upgrade mysql charset and collation to utf8mb4 #1065

Conversation

Danyal-Faheem
Copy link
Contributor

@Danyal-Faheem Danyal-Faheem commented May 16, 2024

fixes #938.

Changes

  • Changes the default character set of MySQL to utf8mb4 and the default collation to utf8mb4_unicode_ci.
  • Also updated the dosqlshell command to use utf8mb4 as the default character set.

Testing

  • I have tested these changes on tutor dev and tutor K8s. I followed these steps:
    • Ran tutor [dev|k8s] launch
    • Imported the demo course and made sure it was working fine
    • Created admin and student users successfully
    • Created a new course with emojis in the title and in subsection titles and made sure changes were being published and reflected in the LMS

Future Work

A new do command tutor local do change-charset-collation to allow users to upgrade the charset and collation of their tutor installations will be added in a later release of Tutor. A WIP PR is already up at #1079.

@Danyal-Faheem Danyal-Faheem marked this pull request as draft May 22, 2024 15:08
@DawoudSheraz DawoudSheraz force-pushed the redwood branch 2 times, most recently from b203c94 to 938abca Compare May 28, 2024 06:18
@Danyal-Faheem Danyal-Faheem force-pushed the danyalfaheem/upgrade-mysql-utf8mb4 branch from 0e17056 to 80796da Compare May 28, 2024 16:49
@Danyal-Faheem Danyal-Faheem force-pushed the danyalfaheem/upgrade-mysql-utf8mb4 branch from 47f01f9 to 438b312 Compare May 28, 2024 17:17
docs/local.rst Outdated
Comment on lines 146 to 157

Your database's charset and collation might not support specific characters or emojis. To upgrade your charset and collation of all the tables in your database to utf8mb4 and utf8mb4_unicode_ci respectively, run::

tutor local do change-charset-collation --all-tables --charset=utf8mb4 --collation=utf8mb4_unicode_ci

Alternatively, if you only want to upgrade certain tables or exclude certain tables, you can use the `status` option. To upgrade the `courseware_studentmodule` and `courseware_studentmodulehistory` tables, run::

tutor local do change-charset-collation --status=include courseware_studentmodule courseware_studentmodulehistory

Tutor performs pattern matching from the start of the table name so you can just enter the name of the app to include/exclude all the tables under that app. To upgrade all the tables in the database except the ones under the student and wiki apps, run::

tutor local do change-charset-collation --status=exclude student wiki
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should specify that the change is not reversible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a warning note at the top.

@Danyal-Faheem Danyal-Faheem marked this pull request as ready for review June 3, 2024 10:35
docs/local.rst Outdated

Your database's charset and collation might not support specific characters or emojis. To upgrade your charset and collation of all the tables in your database to utf8mb4 and utf8mb4_unicode_ci respectively, run::

tutor local do change-charset-collation --all-tables --charset=utf8mb4 --collation=utf8mb4_unicode_ci
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which situation(s) will users have to run this command? Is it for users upgrading from Quince? Is it optional?

Copy link
Contributor Author

@Danyal-Faheem Danyal-Faheem Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they can run it after upgrading from Quince. It is optional as tutor runs fine without it unless you add specific characters such as emojis that will throw DB errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explanation should be present in the docs :)

docs/local.rst Outdated Show resolved Hide resolved
context_settings={"ignore_unknown_options": True},
)
@click.option("--all-tables", is_flag=True, help="change all tables in the openedx database")
@click.option("-s", "--status", type=click.Choice(['include', 'exclude'], case_sensitive=False), help="Whether to include or exclude certain tables/apps")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This combination of options is not very intuitive. It means that the end users needs to specify at least one option -- and that is not a great pattern, as options are considered... optional :)

Instead, I suggest that:

  1. all tables are process by default.
  2. If --include is specified, then only those tables are processed.
  3. If --exclude is specified, then those tables are not processed.

Note that with this approach, --include and --exclude are mutually compatible (--exclude would override --include).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all tables are process by default

I did not want to make this the default scenario in case someone accidentally executes the command while trying out the different options provided and they did not want to upgrade all the tables since this is a potentially irreversible change. Upgrading all tables at once can also take a long time.

Maybe we can prompt the user if they wish to proceed?

Note that with this approach, --include and --exclude are mutually compatible (--exclude would override --include).

I think this can be achieved and sounds more plausible. I'll update the options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can prompt the user if they wish to proceed?

Yes, this sounds perfect. Please make sure to include a -I/--non-interactive option for headless scripts.

tutor/commands/jobs.py Outdated Show resolved Hide resolved

SET FOREIGN_KEY_CHECKS = 0;
Select _table_name;
SET @statement = CONCAT('ALTER TABLE ', _table_name, ' CONVERT TO CHARACTER SET {charset} COLLATE {collation};');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change columns that currently use utf8mb4_bin collation to be utf8mb4_unicode_ci? If so, that's a problem. The relatively small number of columns that use utf8mb4_bin are usually doing so because they explicitly want the values to be case sensitive, because case insensitivity on things like course keys has given us a lot of grief in the past.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify that last statement, an example was that MongoDB and the Python layer of CourseKeys are case sensitive, but the key stored in our CourseOverview model is not case sensitive. So at one point we had two course runs in production that had keys that differed only by case (because someone had misspelled when doing the initial creation), and updates to one could clobber the values for the other because they were fighting over the same row in CourseOverview for updates. (The inactive course run was still getting automated push updates from an external system.)

I realize that fixing CourseOverview is well outside the scope of this work, and I'm not trying to suggest we tackle that here. I only put it out there as an example of why apps might want to explicitly specify a sensitive field for storing things like identifier keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change columns that currently use utf8mb4_bin collation to be utf8mb4_unicode_ci?

It seems this is the case. I think we can add a condition that columns with the utf8mb4_bin collation should not be upgraded. Any other specific charsets or collations that we should look out for here?

Or we can just upgrade all the columns that are using utf8mb3_general_ci so that if some other column has their charset/collation fixed by the model in the future, we don't have to update the check for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your findings, it seems like the only forced collations that are not utf8mb4_unicode_ci are:

  • utf8mb3_bin: This should be mapped into utf8mb4_bin, since that's the logically equivalent case-sensitive collation.
  • utf8mb4_bin: This should be preserved as-is, as discussed.
  • utf8mb4_general_ci: This would be mapped to utf8mb4_unicode_ci, but it looks like it only exists for the bundles app, and that app is unused in Redwood and removed entirely just after the Redwood cut, so it's fine to ignore or update it, depending on what's more convenient in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the command to perform the following mappings instead:

  • utf8mb3_general_ci to utf8mb4_unicode_ci
  • utf8mb3_* to utf8mb4_* meaning that the utf8mb3_bin collations are converted to utf8mb4_bin now.

utf8mb4_bin: This should be preserved as-is, as discussed.

We ignore any columns that already have the utf8mb4 charset.

utf8mb4_general_ci: This would be mapped to utf8mb4_unicode_ci, but it looks like it only exists for the bundles app, and that app is unused in Redwood and removed entirely just after the Redwood cut, so it's fine to ignore or update it, depending on what's more convenient in the code.

Yup, trying it out on the nightly release, the bundles app was not there. Therefore, I've ignored it.

LEAVE tables_loop;
END IF;

SET FOREIGN_KEY_CHECKS = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cause any issue when trying to do a partial migration that includes the course_overviews_courseoverview table? That's the only table I'm aware of that has a varchar primary key and foreign keys from other tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out by upgrading only the course_overviews_courseoverviewtab table which has a foreign key to course_overviews_courseoverview to utf8mb4 and keeping the rest at utf8mb3.

The LMS and CMS seemed to be working fine for me. I tested out the following things:

  • Create new Course
  • Publish changes in new course
  • Enroll in new course and verify everything is working fine
  • Edit an existing course
  • Publish new course and verify changes can be seen by students on LMS

docs/local.rst Outdated Show resolved Hide resolved
@Danyal-Faheem Danyal-Faheem force-pushed the danyalfaheem/upgrade-mysql-utf8mb4 branch from 5cadc9e to 73b3d38 Compare June 14, 2024 13:07
Copy link
Contributor

@DawoudSheraz DawoudSheraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update PR description and add link to the 2nd PR that contains do command work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

5 participants