inclusion of 4-byte unicode character results in plan truncation #247

GoogleCodeExporter · 2015-05-31T18:07:56Z

Username: [baldwint]
Revision: 620
URL: grinnellplans.com/read.php?searchname=baldwint
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) 
AppleWebKit/536.26.17 (KHTML, like Gecko) Version/6.0.2 Safari/536.26.17

How can we reproduce the problem?
1. back up your plan
2. edit plan, inserting a 4-byte unicode character in the middle (I recommend 
💩)
3. submit, observe data loss

What is the expected output? What do you see instead?

In the case of the example above, I expect to either see a turd in the middle 
of my plan, or a blank space indicating that plans does not support turds in 
plans. Instead I see that all content following the turd insertion point has 
been wiped.

This is probably a database thing, since 'utf8' really means '3-byte utf8' in 
MySQL-land. Ref: http://mathiasbynens.be/notes/mysql-utf8mb4 . I suppose we 
could follow the (heinous) upgrade procedure outlined in the article, or else 
implement some filtering on the PHP side to remove any 4-byte characters.

Original issue reported on code.google.com by umhecbaa on 7 Jan 2013 at 4:31

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-05-31T18:07:56Z

This sounds like it will be a giant pain to fix. Although (as of last night) 
all database servers are MySQL 5.5.3+, it looks like we will have to alter 
every last CHAR, VARCHAR, and TEXT column in the database, as well as change 
the sizes to compensate? In addition, we need to ensure MySQL always connects 
to its database server using the right connection charset (WTF, MySQL? This 
shouldn't be that hard!)

So, at the very minimum, we need:
1. A migration to move the database, every table, and every [var]char|text 
column to a mb4 encoding and collation. (what's the difference between the 
utf8mb4_unicode_ci, utf8mb4_general_ci utf8mb4_bin collations? 
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html is somewhat 
less than clear)
2. Updates  to the model files (possibly? no collation is listed in them?) at 
http://code.google.com/p/grinnellplans/source/browse/trunk/db/models/generated 
to maybe fix size problems?
3. A new version of the DB schema at 
http://code.google.com/p/grinnellplans/source/browse/trunk/documents/db-schema 
, to aid anyone keeping track at home.
4. Something in the database connection init to enable mb4.

Am I missing something?

Original comment by [email protected] on 7 Jan 2013 at 8:55

GoogleCodeExporter · 2015-05-31T18:07:56Z

Update on my #4: Database connections are created in at least four different 
places in the code: 
1. When Doctrine is initialized, in bootstrap.php line 13
2. in inc/Database.php, which itself is only used in inc/User.php, 
anonymous.php (secrets), and kommand/style-stats.php. Let's refactor these uses 
to Doctrine and delete inc/Database entirely.
3. in dbfunctions.php's db_connect() function, which is used all over the 
place. 
4. in functions-database.php's db_connect() function. As far as I can tell, 
functions-database.php is included absolutely nowhere. I deleted it on my test 
server, and nothing broke. I am pretty sure this is orphaned code.

Original comment by [email protected] on 7 Jan 2013 at 9:39

GoogleCodeExporter · 2015-05-31T18:07:57Z

Really nice find, Tom.

Alex, you don't necessarily need to alter every string column in the DB - only 
columns that accept user input. And I imagine really it would be good enough to 
fix only the `plans.edit_text` and `plans.plan`, and maybe `accounts.pseudo` 
for good measure.

If the upgrade procedure proves too hideous, seems like we could also come up 
with something like 
http://stackoverflow.com/questions/3220031/how-to-filter-or-replace-unicode-char
acters-that-would-take-more-than-3-bytes?lq=1 to filter these characters before 
they hit the DB - easy enough to do through Doctrine. Not quite as nice as 
fully supporting these characters, but it would prevent data loss, which is the 
most important part.

Original comment by [email protected] on 7 Jan 2013 at 9:51

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter · 2015-05-31T18:07:57Z

This issue was updated by revision r621.

Original comment by [email protected] on 16 Jan 2013 at 2:13

GoogleCodeExporter · 2015-05-31T18:07:57Z

This appears to be fixed on the production server (which is awesome), but it 
still doesn't work for me on my local copy. Are the changes in r621 all you 
did, Alex, or did you also run some conversion queries on the production 
database?

Original comment by umhecbaa on 13 Sep 2013 at 5:39

GoogleCodeExporter · 2015-05-31T18:07:57Z

edit_text and plan (along with a couple other columns) are now utf8mb4, not 
utf8, on prod. I'll post a sanitized database dump this PM.

Original comment by [email protected] on 13 Sep 2013 at 5:45

GoogleCodeExporter · 2015-05-31T18:07:57Z

Here's a tables-only database dump (don't restore from this - it's missing 
things like the data for interfaces and system).

(Also, all tables are InnoDB. We're not using transactions or any of the other 
InnoDB cool stuff, but it's purportedly more crashsafe, which on EC2 is a good 
thing)

Original comment by [email protected] on 13 Sep 2013 at 5:55

Attachments:

plans_prod.structureonly.20130913.sql

GoogleCodeExporter · 2015-05-31T18:07:57Z

Thanks for the database dump, Alex. By diffing against a structure-only dump 
from my development database I managed to figure out how to convert it myself. 
Attached are the queries I made. The first script moves to InnoDB, the second 
converts the 3 most important columns to utf8mb4.

I suppose these could be rewritten as doctrine migrations but since the queries 
(or something similar) have already been run manually on production I did not 
want to create confusion.

4-byte unicode characters now work in plans, but not yet for plan names. 
Production plans seems to store the character ok (when I return to the 'change 
name' form, I can retrieve the text I entered) but it displays as mojibake in 
read.php (check my current plan name for an example).

I was hoping to work on this tonight now that my schema matches production, but 
my local server is failing in a different mode. It seems to store the character 
as '????' in the database. Did I miss something while converting?

Original comment by umhecbaa on 23 Sep 2013 at 3:20

Attachments:

GoogleCodeExporter added Type-Defect auto-migrated Priority-High labels May 31, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inclusion of 4-byte unicode character results in plan truncation #247

inclusion of 4-byte unicode character results in plan truncation #247

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

inclusion of 4-byte unicode character results in plan truncation #247

inclusion of 4-byte unicode character results in plan truncation #247

Comments

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015

GoogleCodeExporter commented May 31, 2015