open() utf-8 standardization #75

munyanjacob · 2021-12-03T07:36:02Z

Extending the discussion from #74 to a new issue:

I do think that testing to make sure people are using utf-8 makes sense, because I also think it's likely for locale encoder differences to be a future problem if locale encoders aren't ensured to be utf-8 or every open() is not specified to use utf-8.

Because the page linked said it would be best practice to specify, is this something we should update other instances of open() to have & require in the future? It sounds simpler to test for utf-8 once at the start, but best practices are best practices so idk which way to proceed

Originally posted by @munyanjacob in #74 (comment)

The text was updated successfully, but these errors were encountered:

jcushman · 2021-12-03T19:29:25Z

Yeah, it seems to be technically correct for every open in a cross-platform library to have an encoding, so changing other open calls seems fine.

I don't think (unless maybe PEP 597 lands) there's any way to automatically enforce that practice, so it's likely to stay more a fix-it-if-someone-complains thing in this and other projects. Which in turn means, I'm guessing, Windows python users must be mostly setting export PYTHONUTF8=1 to solve the problem, or using linux in WSL, because who wants to chase down a bunch of libraries to fix their open commands after every package update.

I don't think we (or any given library) should be testing the locale and complaining at the start. Maybe Python itself should change the default encoding and warn people about it, but users don't need random libraries complaining about their locale setting.

Oh, actually a larger fix might be to get the CI tests running on Windows -- I think Github Actions supports that.

mlissner · 2021-12-03T19:55:28Z

I agree with everything Jack says, per usual.

The only other thought I have is that semgrep helps a lot with these kinds of things. They don't currently have a rule requiring an encoding parameter to open(), but I bet they'd be happy to have one. They have a bunch of rules categorized as "python/lang/best-practice" (including one to make sure open() commands are closed):

https://github.com/returntocorp/semgrep-rules/tree/develop/python/lang/best-practice

If you got such a rule put together in semgrep, we could turn it on for this repo (we use it elsewhere already), and it would throw an error in PR's that don't include the encoding parameter. That'd fix it forever. Including Windows in our CI would help too, like Jack says.

There's a guide here for creating new rules: https://semgrep.dev/docs/writing-rules/overview/

I contributed one a bit back, it was pretty painless. Yaml and Python.

munyanjacob · 2021-12-03T20:35:38Z

Awesome, adding it to semgrep and including Windows in the CI both sound good.

Although the uncommon few who experience it can just do what Jack mentioned, for now I'll still take a look at adding the rule to semgrep.

I have been learning a lot just with this one issue - thank you both!

mlissner · 2021-12-16T00:54:36Z

How did this go, @munyanjacob ? Any progress?

munyanjacob · 2021-12-17T00:51:32Z

I finished writing the rule but have not been able to commit it (appears to also be because of a Windows issue, this time with Windows marking files as executable and pre-commit requiring they have shebangs).

Currently downloading WSL to see if using that can help. Windows does not feel good.

mlissner · 2021-12-17T00:58:19Z

Windows, as you're learning, may be an uphill fight. :)

Usually, pre-commit will fix most errors for you. I usually run it once (it fails and fixes things), and then I run it again (and it passes).

munyanjacob · 2021-12-17T20:55:48Z

The rule just merged but with severity of warning instead of error. Is a warning going to be enough for requiring an encoding in new PRs?

munyanjacob · 2021-12-17T23:56:22Z

Windows, as you're learning, may be an uphill fight. :)

Do you guys generally run Linux, Mac, or WSL? I knew it was common for people to use linux/mac for dev, but I thought there would be more Windows users running into problems like mine lol

mlissner · 2021-12-18T00:33:21Z

I think the severity should be enough. I just enabled semgrep on this repo, so I think we'll see as soon as somebody does a PR with a bad open in it. Good stuff. I also enabled it for our other repos that use semgrep.

I run Linux. I don't know what Jack's using. Nobody around here uses Windows that I have noticed in awhile. I usually only find out they do because something is broken for them (like this). I like Ubuntu, FWIW!

munyanjacob mentioned this issue Dec 17, 2021

Add rule for specifying encoding on file open (python best practice) semgrep/semgrep-rules#1670

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open() utf-8 standardization #75

open() utf-8 standardization #75

munyanjacob commented Dec 3, 2021

jcushman commented Dec 3, 2021

mlissner commented Dec 3, 2021

munyanjacob commented Dec 3, 2021

mlissner commented Dec 16, 2021

munyanjacob commented Dec 17, 2021

mlissner commented Dec 17, 2021

munyanjacob commented Dec 17, 2021 •

edited

Loading

munyanjacob commented Dec 17, 2021

mlissner commented Dec 18, 2021

open() utf-8 standardization #75

open() utf-8 standardization #75

Comments

munyanjacob commented Dec 3, 2021

jcushman commented Dec 3, 2021

mlissner commented Dec 3, 2021

munyanjacob commented Dec 3, 2021

mlissner commented Dec 16, 2021

munyanjacob commented Dec 17, 2021

mlissner commented Dec 17, 2021

munyanjacob commented Dec 17, 2021 • edited Loading

munyanjacob commented Dec 17, 2021

mlissner commented Dec 18, 2021

munyanjacob commented Dec 17, 2021 •

edited

Loading