Non-ascii character may trigger "unreachable" parser state #190

ghost · 2023-01-29T19:33:11Z

Environment

toml++ version and/or commit hash:
toml++ v3.3.0 commit c635f21

Compiler:
g++ 10.2.1

C++ standard mode:
-std=c++17

Target arch:
x86_64

Library configuration overrides:
none

Relevant compilation flags:
-std=c++17 -O2 -fsanitize=undefined

Describe the bug

In certain very specific cases, a non-ascii character can force the parser into the TOML_UNREACHABLE state in function is_non_ascii_horizontal_whitespace().

The conditions that cause this seem to be:

TOML file contains a multi-line basic string
and the multi-line string contains a line-ending-backslash
and the first non-whitespace character on the next line is a non-ascii character in a certain codepoint range.

Steps to reproduce (or a small repro code sample)

The following TOML document triggers this issue:

s = """\
§"""

Note that the last character on the first line is a backslash.
The first character on the second line is U+00A7.

The following C++ program also demonstrates this:

#include <string>
#include <iostream>
#include <toml++/toml.h>

int main()
{
	std::string doc = "s = \"\"\"\\\n\xc2\xa7\"\"\"\n";
	try
	{
		auto data = toml::parse(doc, std::string("tricky"));
		std::cout << "great!!" << std::endl;
	}
	catch (const toml::parse_error& err)
	{
		std::cerr << "\n\n" << err << "\n";
		return 1;
	}
	catch (...)
	{
		std::cerr << "\n\nAn unspecified error occurred.\n";
		return 1;
	}

	return 0;
}

If I compile with -fsanitize=undefined, this program prints

include/toml++/impl/unicode_autogenerated.h:44:13: runtime error: execution reached an unreachable program point

Without the -fsanitize flag, the parser runs without errors and appears to parse the document correctly.

Additional information

Apart from the specific issue in the implementation of is_non_ascii_horizontal_whitespace(), I also doubt whether it is in principle even correct to assign special significance to non-ascii whitespace. The TOML 1.0.0 specification talks about "trimming whitespace", which is open for interpretation. However, the ABNF grammar unambiguously specifies whitespace as being ASCII space and ASCII tab and nothing else. And this narrow interpretation of whitespace is applicable to the definition of mlb-escaped-nl in the ABNF.

For example, in the following TOML document, toml++ parses s as an empty string, while I think the correct answer is that s contains a single character U+00A0:

s = """\
<U+00A0>"""

I'm experimenting with a TOML fuzzer which revealed this issue. Out of 1000 randomly generated TOML documents, toml++ now parses 985 documents correctly. The remaining 15 documents seem to fail on this issue. (This is a pretty good score; many other TOML parsers fail on the majority of random files.)

Sorry to bother you again with an obscure case. I understand if this gets low priority. But it may still be worthwhile to fix it eventually.

The text was updated successfully, but these errors were encountered:

marzer · 2023-01-29T19:45:12Z

Nice! I love when people use fuzzers - you're not the first to throw one at toml++. I really should figure out how to set one up at some stage.

I also doubt whether it is in principle even correct to assign special significance to non-ascii whitespace.

Yep. This handling was a leftover from before the TOML spec was more explicit about ascii/unicode whitespace; at one stage it was essentially left as an exercise to the reader what to do with non-ascii whitespace.

ghost · 2023-01-29T20:11:50Z

My random TOML tester is here https://github.com/jorisvr/toml-test-runner/tree/main/random_test
You are of course welcome to use it if you want. However the status of my code is ... let's call it "experimental".

marzer · 2023-01-30T08:45:10Z

However the status of my code is ... let's call it "experimental".

Sounds like all the python I've ever written tbh 😅

ghost added the bug Something isn't working label Jan 29, 2023

ghost assigned marzer Jan 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-ascii character may trigger "unreachable" parser state #190

Non-ascii character may trigger "unreachable" parser state #190

ghost commented Jan 29, 2023

marzer commented Jan 29, 2023 •

edited

Loading

ghost commented Jan 29, 2023

marzer commented Jan 30, 2023

Non-ascii character may trigger "unreachable" parser state #190

Non-ascii character may trigger "unreachable" parser state #190

Comments

ghost commented Jan 29, 2023

Environment

Describe the bug

Steps to reproduce (or a small repro code sample)

Additional information

marzer commented Jan 29, 2023 • edited Loading

ghost commented Jan 29, 2023

marzer commented Jan 30, 2023

marzer commented Jan 29, 2023 •

edited

Loading