-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embed wasm by encoding binary data as UTF-8 code points. #21478
base: main
Are you sure you want to change the base?
Conversation
Utf8? I like the idea, it is just that doesn't utf8 takes 2 bytes for cp > 127? If I have really bad luck, the file size will be big? |
Since the contents of the wasm file will contain wasm opcodes, the byte distribution is going to be very determinstic. So I would expect a large Unity Boat Attack .wasm file to be as representative of a typical opcode distribution as any other Wasm file. Also since the hypothesis was that gzip/brotli compress byte-aligned data better than non-byte aligned data, then presumably it won't matter as much that some of the input bytes expand out to two output bytes. (I think they could expand out to e.g. four output bytes as well, and gzip/brotli would likely pick up on such patterns) Though like I mentioned in the other thread, if the file size on disk before the web server gz/br compression is important, then the expansion to two bytes will show up there. |
Looks like other tests pass, but CircleCI gets a failure in core2:
I am unable to reproduce that failure locally on my Windows system. At a first glance this failure looks like something that should not be caused by this PR(?), since this PR only affects later down the pipe, but I am not 100% sure yet. In any case, since that is the only failure, this PR should now look good enough to have eyeballs to review as a first pass. In particular everything is now validated to work. The notable thing here is that this scheme requires This PR still has the feature behind an extra setting. Would we want to remove that setting, and always have this on? Or maybe an extra setting might be good, so people have a way to opt out in case odd situations arise at first pass (such as sublimehq/sublime_text#6320 or google/closure-compiler#4159) |
If you've found 2 bugs with tools/IDEs within the first day, then we should expect there to be more. It seems quite risky. |
Yeah, having an opt-out setting for a while would be good from that perspective. |
@@ -606,6 +606,8 @@ def closure_compiler(filename, advanced=True, extra_closure_args=None): | |||
args += ['--language_out', 'NO_TRANSPILE'] | |||
# Tell closure never to inject the 'use strict' directive. | |||
args += ['--emit_use_strict=false'] | |||
# Always output UTF-8 files, this helps generate UTF-8 code points instead of escaping code points with \uxxxx inside strings. https://github.com/google/closure-compiler/issues/4158 | |||
args += ['--charset=UTF8'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see this land separately too if possible, along with a test that closure no longer produces \uxxx
for UTF-8 strings in the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effect is already verified by the existing tests at https://github.com/emscripten-core/emscripten/pull/21478/files#diff-95c5e9a22c0a073731d32eb5c839692fd5851db19b919d30f14c2fde0c7c261bR2-R5, so we should be covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still love to see this change (along with the test it effects) land separately, but I won't force you to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait the change in file size you are referring to is due to SINGLE_FILE_BINARY_ENCODE
, right?
I was wondering if this change would effect other tests.. In theory it should effect any JS code which includes a UTF-8 string, right? Regardless of SINGLE_FILE
usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if this change would effect other tests.. In theory it should effect any JS code which includes a UTF-8 string, right? Regardless of
SINGLE_FILE
usage
Yeah, this would specify the encoding to be UTF-8 for all output when pushed through Closure.
Is this working. Can I try it? |
Yeah, looks like the CI now passes (there were some intermittent CI failures last week), so this should be good to land. |
@@ -4851,7 +4853,7 @@ def test_single_file_locate_file(self): | |||
# Tests that SINGLE_FILE works as intended in a Worker in JS output | |||
def test_single_file_worker_js(self): | |||
self.compile_btest('browser_test_hello_world.c', ['-o', 'test.js', '--proxy-to-worker', '-sSINGLE_FILE']) | |||
create_file('test.html', '<script src="test.js"></script>') | |||
create_file('test.html', '<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"></head><body><script src="test.js"></script></body></html>') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test break without this change?
Does the charset in the <head>
somehow effect the script tag?
Does the mean that some/all users of SINGLE_FILE
would ned to add an explicit charset to their html?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our existing shell files have already ~from the dawn of time had a <meta charset='utf-8'>
directive. That is part of the general "best practices" boilerplate.
So we can expect quite safely that all the custom shell files that users have monkeyed off of any of the existing shell files we provide, will be UTF-8 (unless they have explicitly removed the meta charset directive, which is a bad idea in general)
About 98.2% of all websites utilize UTF-8, it is practically the only encoding used on the web. I don't really know of anyone who would use any other encoding on their web pages anymore.
It is just that our tests scripts have omitted the meta charset='utf-8' directive.
(If users did not want for some reason to specify that meta charset directive, they can also save the HTML file with a UTF-8 BOM marker, or they can send a Content-Type: application/html; charset=utf-8
HTTP header, but typically the meta charset directive is the simplest way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, without this change, this test is broken?
I wonder if we should write a test to verify that -sSINGLE_FILE
output is broken-by-default without UTF-8
encoding but that adding <meta charset='utf-8'>
or disabling SINGLE_FILE_BINARY_ENCODE
fixes it?
I guess we don't have any other non-ascii strings in our generated code otherwise <meta charset='utf-8'>
would be required regardless of SINGLE_FILE
and / or SINGLE_FILE_BINARY_ENCODE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, without this change, this test is broken?
That's correct, the test would not work otherwise.
I wonder if we should write a test to verify that
-sSINGLE_FILE
output is broken-by-default withoutUTF-8
encoding but that adding<meta charset='utf-8'>
or disablingSINGLE_FILE_BINARY_ENCODE
fixes it?
I am not sure what would be the benefit of such test?
If we wanted to go the extra mile, I think we would want to either generate a UTF-8 BOM into the HTML file, or issue a link time warning if generated HTML file contains UTF-8 code points, but the HTML file does not contain a <meta charset='utf-8'>
directive.
I guess we don't have any other non-ascii strings in our generated code otherwise
<meta charset='utf-8'>
would be required regardless ofSINGLE_FILE
and / orSINGLE_FILE_BINARY_ENCODE
?
That is true. We (Emscripten) don't, but if users would have any UTF-8 chars they contain in JS libraries, then that would require them to specify the charset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue a link time warning if generated HTML file contains UTF-8 code points, but the HTML file does not contain a
<meta charset='utf-8'>
directive.
This would a useful warning for users who generate HTML output but my understanding is that the vast majority of our users generate JavaScript and then write their own HTML separetely. I'm curious, does unity use emscripten to generate HTML? Is so I guess that would be big use case for HTML output in production. Otherwise I don't know of any.
@@ -606,6 +606,8 @@ def closure_compiler(filename, advanced=True, extra_closure_args=None): | |||
args += ['--language_out', 'NO_TRANSPILE'] | |||
# Tell closure never to inject the 'use strict' directive. | |||
args += ['--emit_use_strict=false'] | |||
# Always output UTF-8 files, this helps generate UTF-8 code points instead of escaping code points with \uxxxx inside strings. https://github.com/google/closure-compiler/issues/4158 | |||
args += ['--charset=UTF8'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still love to see this change (along with the test it effects) land separately, but I won't force you to do it.
@@ -656,7 +658,8 @@ def move_to_safe_7bit_ascii_filename(filename): | |||
# 7-bit ASCII range. Therefore make sure the command line we pass does not contain any such | |||
# input files by passing all input filenames relative to the cwd. (user temp directory might | |||
# be in user's home directory, and user's profile name might contain unicode characters) | |||
proc = run_process(cmd, stderr=PIPE, check=False, env=env, cwd=tempfiles.tmpdir) | |||
# https://github.com/google/closure-compiler/issues/4159: Closure outputs stdout/stderr in iso-8859-1 on Windows. | |||
proc = run_process(cmd, stderr=PIPE, check=False, env=env, cwd=tempfiles.tmpdir, encoding='iso-8859-1' if WINDOWS else 'utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this and existing bug, or does it only show up with --charset=UTF8
is added above? (what is the default charset used by closure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an Closure compiler bug. It occurs both with and without --charset=UTF8
, so does not relate to that.
I don't know what is the default charset used by Closure.
data = base64.b64encode(utils.read_binary(path)) | ||
return 'data:application/octet-stream;base64,' + data.decode('ascii') | ||
if settings.SINGLE_FILE_BINARY_ENCODE: | ||
return binary_encode(utils.read_binary(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not need some kind of data:application/octet-stream;
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, these binary embedded strings are not passed to the browser to decode as Data URIs, so specifying a MIME type would be meaningless.
src/binaryDecode.js
Outdated
// Closure may analyze through the WASM_BINARY_DATA placeholder string into this | ||
// function, leading into incorrect results. | ||
/** @noinline */ | ||
function binaryDecode(bin) { for(var i = 0, l = bin.length, o = new Uint8Array(l); i < l; ++i) o[i] = bin.charCodeAt(i) - 1; return o; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this to be all on one line is there?
Seems a little unfortunate to create an entirely new file just to hold this tiny function. Perhaps you can add it to runtime_shared.js
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file proxyClient.js does not include runtime_shared.js, so that would not quite work.
Expanded the function on multiple lines.
@@ -1817,6 +1817,9 @@ var WASMFS = false; | |||
// [link] | |||
var SINGLE_FILE = false; | |||
|
|||
// If true, does binary encoding of data instead of base64 encoding. | |||
var SINGLE_FILE_BINARY_ENCODE = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is true
by default why have a setting for this?
Is it because some folks can't handle UTF-8? If so, it might be worth mentioning there. e.g. "Disable this if you can't handle UTF-8 chars in the generated JS".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that it would allow developers to pass -sSINGLE_FILE_BINARY_ENCODE=0
to revert back to the previous base64 encoding, should issues arise.
Updated the text doc.
# Conflicts: # src/preamble.js # test/code_size/random_printf_wasm.json # test/test_browser.py
… Improvement from previous base64 encoding to binary encoding is: size of a.html == 17586, expected 18839, delta=-1253 (-6.65%) size of a.html.gz == 10152, expected 10973, delta=-821 (-7.48%) Total output size=17586 bytes, expected total size=18839, delta=-1253 (-6.65%) Total output size gzipped=10152 bytes, expected total size gzipped=10973, delta=-821 (-7.48%) Hey amazing, overall generated code size was improved by 1253 bytes!
Updated this PR to latest. Added new -sSINGLE_FILE code size test in binary encoded mode. The improvement delta from base64-encoded SINGLE_FILE to this binary-encoded SINGLE_FILE in that code size test is
so the new test shows that the size improvement carries through post-gzip. It's been a while since we discussed this. I still think this would be a good feature to land, be enabled by default, and allow an opt-out that users can have to revert to old base64 form. |
CI looks good and green now. This is good for another round of review. |
@@ -2857,6 +2857,18 @@ then you can safely ignore this warning. | |||
|
|||
Default value: false | |||
|
|||
.. _single_file_binary_encode: | |||
|
|||
SINGLE_FILE_BINARY_ENCODE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling this SUPPORT_UTF8_EMBEDDING
to match the existing SUPPORT_BASE64_EMBEDDING
?
I do wish we could make this change without adding yet another setting, but I guess its not reasonable to make it unconditional. Do you think we could make it unconditional at some point in the future?
#if SINGLE_FILE && SINGLE_FILE_BINARY_ENCODE | ||
#include "binaryDecode.js" | ||
var workerURL = URL.createObjectURL(new Blob([binaryDecode(filename)], {type: 'application/javascript'})); | ||
#else | ||
var workerURL = filename; | ||
if (SUPPORT_BASE64_EMBEDDING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this change but this line looks very strange. Shouldn't this be #if SUPPORT_BASE64_EMBEDDING
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick and dirty test to see what the CI says.
Possible alternative to #21426.