-
Notifications
You must be signed in to change notification settings - Fork 20
Define syntax and format of REUSE.yaml #81
Comments
When it comes to YAML flavours I think all should be OK – I guess we would use an external parser and linter anyway, right? For files that Regarding traditional copyright statements, I think it is reasonable to expect an SPDX tag, but after it, it should be free text form. Non-SPDX-tag statements were accepted before for legacy reasons. The YAML file is going to be new, so no legacy exists for it. Even if someone has a preferred format, they can just prepend it with SPDX tag. Globbing – no preference, as long as it’s something that is in common practice and coherent. Conflict resolution – I agree with your proposal. |
I think that the syntax should avoid the strings “SPDX-License-Identifier:” and “SPDX-<tagname>:”. Those strings are likely to cause false positives. Tools that aren’t REUSE.yml aware will mistakenly assume that the data applies to REUSE.yml. Here’s my proposal: Option 1: list- files: "src/*"
info:
- "FileCopyrightText: 2020 Me"
- "FileCopyrightText: © 2017 You"
- "License-Identifier: MIT" Option 2: multi-line string- files: "src/*"
info: |
FileCopyrightText: 2020 Me
FileCopyrightText: © 2017 You
License-Identifier: MIT Option 3: license and copyright as separate keys- files: "src/*"
"FileCopyrightText":
- "2020 Me"
- "© 2017 You"
"License-Identifier": MIT If we do decide to drop the “SPDX-”, then I would recommend option 3. That way, if someone makes a mistake and includes the “SPDX-”, they have to do less to fix it. I would also recommend making the REUSE Tool give a helpful error when this mistake happens. For example, it could say “Found ‘SPDX-License-Identifier’ in REUSE.yml. In REUSE.yml, use ‘License-Identifier’ instead (no ‘SPDX-’).” |
Great catch, @Jayman2000! What you write makes sense to me. It does provide some extra complication, but seems worth it to me in order to avoid future issues. |
Why not rename |
We would like to have the SPDX project make this part of their spec, too, in order to not create conflicts with other compliance tools and practices (see: spdx/spdx-spec#502). In SPDX, there are multiple "license" fields for instance, e.g. the concluded or declared license. I am afraid that this unclear terminology would not pass SPDX. However, main goal is to avoid confusion: so either we stick with the tags that are already used in REUSE (except in DEP5) or we make them really simple (as you suggested). |
We would like to have the SPDX project make this part of their spec, too, in order to not create conflicts with other compliance tools and practices (see: spdx/spdx-spec#502 <spdx/spdx-spec#502>).
Why I can see why you might want that, I'm not sure that's a goal worth pursuing. One of the reasons I keep my usage of SPDX to the minimum is its verbosity. I fear if and when your proposal is merged into SPDX, it's going to become yet another verbose way of specifying licensing information people will avoid.
I'm also unsure why you want to deprecate DEP-5, which in my view is superior to many other similar formats. If something isn't quite right in it, I'd personally try to evolve it into a machine-readable copyright format 2.0 rather than abandon it completely.
|
Most of this looks good to me. I would like to add my two cents in regards to two things:
|
I don’t think YAML vs JSON is an issue with Python: there are multiple YAML libraries for Python (pyyaml, ruamel, strict-yaml), so YAML is quite well-supported. |
JSON is in the standard library and |
The files we intent to use have not much in common with a full SPDX SBOM, for which I agree that they are impossible to parse for humans. However, making REUSE's labelling compatible with an ISO standard has the great advantage that the likelihood of being compatible with other tools and best practices is much higher. I see the advantage of creating own specs, but following the practice of "not invented here" even if there are somewhat good alternatives has only seldomly advanced technology.
Please read the full discussion and proposal that I've linked in the first post. There are good reasons why DEP-5 is not ideal for our purpose: https://lists.fsfe.org/pipermail/reuse/2020q3/000085.html |
Reading the discussion on spdx/spdx-spec#502 one thing stands out to me, the desire to align with the SPDX YAML. I think the current thoughts best align with the Based on that I would consider a file like: ---
spdxVersion: "SPDX-2.3" # mandatory to allow future spec changes
creationInfo: # optional
comment: "Easily add metadata to image files."
created: "2022-05-25"
# and other metadata if desired
# FIXME: perhaps needs information that this is to be considered input, not output
files:
# In line with SPDX YAML output
- copyrightText: "Copyright Photographer X"
fileContributors: ["Photographer X"] # optional
licenseConcluded: "CC-BY-4.0"
fileName: "./images/other-author.jpg"
# My main proposal for simplicity
- fileGlob: "./images/*.jpg" #or another term, but to differentiate from 'fileName'
copyrightText: |
Copyright 2022 Photographer X
Copyright © 2022 Image editor Y
fileContributors:
- "Photographer X"
- "Image editor Y"
licenseConcluded: "CC-BY-4.0" # I don't see a reason to change the key, or is there? I know the format is quite different from earlier proposals:
I step into this discussion quite late, so feel free to point out my false reasoning. |
Apart from having to put the file in Instead of creating a new YAML format, have you considered extending dep5 support so that it is possible to put files at any directory level? Like what you are proposing with Also, from the linked email:
Using
The only required information that's not directly related to REUSE is the On the other hand if this YAML format gets standardized as an official SPDX format and it is not too verbose it would be nice to adopt it instead :) Edit: forgot to mention, but implementation details such as Python's standard library support for YAML, JSON, etc should not be a high priority (I wouldn't consider them at all... one of the points of standardizing a format is the possibility of having different interoperable implementations, regardless of the programming language used) |
@mxmehl to followup on the issues I identified in rust-lang/rust#99415 (comment), I'm wondering whether Tachi's proposal of a The discussion to define the YAML format seems to have stalled on the SPDX side, and implementing |
Quite the opposite, I’m afraid, @pietroalbini. There are several points where DEP5 (mostly, but not only, due to historical reasons) differs from SPDX and REUSE. To use DEP5 in REUSE was a good hack early on, but as it (and SPDX) becomes more wide-spread, the problems, exceptions, workarounds etc. that REUSE would need to do to make DEP5(-ish) usable make it quite an obstacle. And bending DEP5 to suit REUSE seems to break much more than creating our own SPDX(-derived) YAML format. |
I don't get it. The machine-readable copyright format, to which you not quite correctly refer to as DEP5, has been in use in Debian for quite a long time, more than a decade if I remember correctly. So far, as far as I'm aware, we haven't received requests for improvement from Reuse, but if we did, I'm certain they could eventually result in a version 1.1 or even 2.0. After all, the goal of the format was to provide human- and machine-readable way of documenting license and copyright information, so if it didn't fulfill that goal, improving it was never off the table.
The only real downside of it as opposed to a YAML-based format is a need for a parser, but that's been solved ages ago (and also the format is a composition of well-known standards such as RFC 822, so it's not exactly something odd).
…--
Cheers,
Andrej
|
@silverhook I understand your desire for a format compatible with the wider SPDX ecosystem! I don't have a preference for either choice myself, but there are currently issues that I'd like to help fix that are blocked on this. The point I was making was that to adopt Of course I'm an outsider to the project, and I don't have many insights on how hard gathering the consensus within the REUSE project would be 🙂 As I hinted before, I'm working to adopt REUSE in the Rust compiler, and we're facing some blocker issues:
I'm willing to help with some implementation work to solve the two issues I mentioned above, but designing and gathering consensus in SPDX for a suitable format is going to take more time than I can commit. To be clear, I don't want to pressure you into making a choice you don't like just because we want to adopt REUSE in the Rust project. If we can't find a solution in the near term to those issues, we'll just have to create our own bespoke tooling and wait for those issues to be addressed before reconsidering REUSE. |
Citing @silverhook:
As I asked in #81 (comment), could you please explain why DEP5 doesn't currently suit REUSE's needs? Yes, it doesn't support all SPDX's features, but neither does REUSE. As far as I understand, SPDX's scope is far broader than just handling licensing information, while REUSE's goal is to "Make licensing easy for everyone", and DEP5's simple and limited format perfectly aligns with this goal, as I've been able to observe in different open source projects. I don't know your plans for the future of REUSE, so I'm of course missing something. Hence, would you please help us better understand your point? Thanks :) |
|
Thanks for you nice and complete reply!
I completely agree with this point. In fact, I find it a bit odd that Rust decided not to add license headers to their files.
Yeah, that's true. If I were a Python guy I would've put some effort into moving the DEP5 parser in a separate, less Debian-specific package. But I'm not :/
Isn't option one in the linked issue independent of the file format? Also, I think that adding support in DEP5 for a glob like the one you mentioned ("all files in docs/* except those with a certain file extension") is something that could be useful to Debian too. Anyway, yes, DEP5 doesn't support, and likely never will, any overriding mechanism, but please keep in mind that adding such a feature could be a double edged sword - ideally, REUSE.yaml (or REUSE.dep5) should be easily understandable without having to look to much at the documentation.
I'd argue that DEP5 is way more user friendly than YAML, especially if you've never used neither of those before (and if you're not used to the concept that indentation really matters) - but as you say, this is subjective. In any case, please keep in mind that Debian really cares about license compliance and copyright attributions (the copyright format was not created by accident!), and I'm sure some Debian folks (including me) would be more than glad to help with REUSE (with regards to evolving DEP5, making the python parser more portable and reliable, etc.) :) |
Thanks @carmenbianca for explaining the concerns you all have about using DEP5 for the new file format. Having more clarity on that rationale helps. I'm wondering then, what are the next steps for this issue? Both of the issues preventing Rust from adopting REUSE are blocked on this issue, and while I have some time to spend on improvements to REUSE, gathering consensus for a format inside SPDX is something I unfortunately can't commit to.
Heh, I agree that in an ideal world adding per-file headers would be better, but there is opposition in the Rust project to add those headers, and 5 years ago the project decided to remove the existing headers from the codebase. Having the licensing definitions into a centralized file is the compromised I managed to reach. |
Thanks for the constructive exchange of opinions and arguments!
I understand. Thanks for what you tried and accomplished!
Understandable. The REUSE team is working on creating a concrete proposal for including this in the next SPDX spec (whenever this will be released...) and will include some stakeholders later in the process to implement feedback early on and reduce friction. No concrete timeline yet and certainly nothing that's done in the next few weeks unfortunately. |
I already did in the REUSE chat, but I hereby publicly volunteer to take on the SPDX side of this. (This is not to contradict @mxmehl , but to support him and perhaps make the public message more clear that people are working on this.) REUSE snippets support just about got into the last SPDX spec version on time, so there’s ample time until the next revision. From what I can tell, the way we set up REUSE so far, it shouldn’t be a huge impact on SPDX anyway. So as long as someone keeps an eye that we’re using the right SPDX tags and not misusing them (again, I volunteer for that part), we should be able to draft a full |
I’m not happy with this discovery, esp. this late in the development of https://ruudvanasseldonk.com/2023/01/11/the-yaml-document-from-hell Perhaps TOML would be a better choice? (which itself is not free of criticism either, of course) 😨 Ultimately, there’s – surprise! ;) – no perfect format:
|
In the Rust community we use TOML extensively and... it's fine. In my experience TOML is fairly nice and concise if the schema is designed around the TOML structure and limitations, and painful if you just uplift the schema you used in YAML into TOML. The suggestion I can make if you want to go with TOML is to start designing the REUSE schema from scratch with it rather than just port the YAML work and serialize it in TOML. |
I’ve been toying with TOML (in a different and very limited use case) a bit and so far my biggest issues were in practice just two:
I think REUSE could definitely be done simply in TOML, if we decide for that instead. Neither of the two issues I ran into should come up in REUSE really. A very good point, @pietroalbini, thanks for the tip! |
Yeah, I recall that we talked about the issues of YAML already when we talked about whether it should rather be JSON. We didn't make a decision as both have problems - spec-wise or user-friendliness-wise. We also had a short look at StrictYAML, but as this post suggests it's far from perfect. I waver between YAML and TOML.
For reference, here's the current format we came up with in internal exchanges: version: 1
annotations:
- path: src/*
SPDX-FileCopyrightText:
- 2020 Me
- © 2017 You
SPDX-License-Identifier: MIT
- path: test.md
SPDX-FileCopyrightText:
- "(c) containing a : for some reason must be quoted"
SPDX-License-Identifier: 0BSD |
Just as an exercise, I think a TOML version could look as such: version = 1
[[annotations]]
path = "src/*"
SPDX-FileCopyrightText = [
"2020 Me",
"© 2017 You",
"(c) whitespace/identing is optional gGmbH"
]
SPDX-License-Identifier = "MIT"
[[annotations]]
path = [ "test.md", "README.md" ]
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense" I’m sure @pietroalbini can come up with a more elegant way than I. |
That actually looks fairly good and idiomatic @silverhook! The only change I'd make is replacing the |
That’s what I suggested some time ago, and it was rejected 🙂 |
We discussed that but decided to stick with the known tags to make it easy for users and scanners. For instance, some people also use other SPDX tags in comment headers, e.g. Regarding scanners, it was mentioned that SPDX tags would trigger false-positives. This would happen anyway with all the IDs and copyright statements. |
LGTM, except one line:
Do we want Generally, I feel that the lists using |
I don’t have strong feelings either way on the “string vs list of strings” question. I leave that to people who use that more often than I do. (I’ll only add that it feels a bit odd that If it turns out it’s more preferable to keep it simple, while more verbose, we could just say that In that case my example would be then: version = 1
[[annotations]]
path = "src/*"
SPDX-FileCopyrightText = [
"2020 Me",
"© 2017 You",
"(c) whitespace/identing is optional gGmbH"
]
SPDX-License-Identifier = "MIT"
[[annotations]]
path = "test.md"
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense"
[[annotations]]
path = "README.md"
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense" |
Il giorno lun 16 gen 2023 alle 12:14:41 -08:00:00, Matija Šuklje
***@***.***> ha scritto:
(I’ll only add that it feels a bit odd that SPDX-FileCopyrightText
can be a list, but path and/or SPDX-License-Identifier can’t)
Letting SPDX-License-Identifier be an array can be ambiguous. The Meson
build system allows this in their `license` field, but then you cannot
tell if [ "GPL-3.0-or-later", "ISC" ] means "GPL-3.0-or-later AND ISC"
or "GPL-3.0-or-later OR ISC". Yes, you could say that "both means AND",
but why introduce yet another idiom when SPDX license expressions work
fine?
Here's the Meson PR for completeness:
mesonbuild/meson#9940
|
Have you considered https://nestedtext.org/ in the list of potential file formats? |
@Tachi107, I absolutely agree! I am not saying we should let @eli-schwartz, could you provide an example – perhaps translate the one from @mxmehl or me to NestedText? And how widely is it supported/implemented? At a quick glance it looks pretty simple and easy to grasp. |
I would allow both |
An example might look like this:
The official implementation is python, https://nestedtext.org/en/stable/related_projects.html lists e.g. golang and ruby implementations. |
How is it with this line then? Does the
|
To answer my own question, it seems it avoids that pitfall (and quoting is not needed). To cite the documentation:
|
IMHO both TOML and NestedText would work. At this stage, perhaps the best would be to test all these formats with a larger and more complex example to see how they fare in real life examples. |
As discussed in spdx/spdx-spec#502, the SPDX project plans to support a "metadata, pre-document file" that contains specific information about files relative to its position. This follows a request to implement something called REUSE.yaml, first discussed here. This issue is to discuss the exact format and syntax of the file.
Proposed YAML options
In the original discussion, we proposed four different syntaxes. One of them (also disliked by the REUSE team) has been turned down in a SPDX call. I removed two others as they are rather unintuitive and clumsy. Also, I changed the format a bit to comply with the YAML syntax (using
*
as key name is invalid), and added another option.Option 1: list
Each list item is a SPDX tag as used in file headers. Easy to read thanks to the
-
, but all items must be wrapped in"
to escape the:
which would separate a key from a value – we cannot have multiple keys!Option 2: multi-line string
SPDX tags are just separated by new lines. No
-
or escaping of:
are required. However, indentation must be preserved for all lines!Option 3: license and copyright as separate keys
We could also separate the two information items. Downside: the keys must be wrapped in
"
to escape the-
in the key name.Background on the YAML keys
Unlike the SPDX YAML format, we would like to avoid
copyrightText
andlicenseDeclared
as key names. In REUSE, theSPDX-License-Identifier
andSPDX-FileCopyrightText
(or alternatively traditional, varying copyright statements) are common and understood by the users.This was also accepted in the SPDX call.
Possible targets
REUSE.yaml is intended to target files that are relative to its position, and only those that are "below".
Statements like
files: "../../src/*"
should not be possible.Supporting traditional copyright statements?
A related question is whether we should only support
SPDX-FileCopyrightText
as indicator for files' copyright, or also "traditional" statements like "Copyright © 2021 Jane Doe".REUSE recommends the SPDX tag, but also supports the traditional statements. My suggestion would be to do the same in REUSE.yaml to reduce friction, but in SPDX this could lead to conflicts. Happy to collect opinions here!
Globbing
DEP-5 uses a simple glob syntax. In this,
*/Makefile
would include any Makefile in all paths below. I am not sure whether this globbing is represented in any native Python module. The benefit of sticking with the DEP-5 glob is that we could more easily convert existing DEP-5 files to REUSE.yaml.Another possibility would be using the Python-native glob.
*/Makefile
would only match a Makefile in one level below, while**/Makefile
would match all Makefiles.We could also use pathspec, supporting the same globbing as
gitignore
.Conflict resolution
As in DEP-5, I would suggest that the last match of a file wins. So if the file
foo.txt
is first matched by*
and then*.txt
, the last statement would count.The dependecy resolution within REUSE and its different options – including REUSE.yaml – is discussed in #70.
The text was updated successfully, but these errors were encountered: