Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference text #1

Open
Seb35 opened this issue Jul 18, 2018 · 13 comments
Open

Reference text #1

Seb35 opened this issue Jul 18, 2018 · 13 comments

Comments

@Seb35
Copy link
Member

Seb35 commented Jul 18, 2018

@RouxRC @mdamien: I am testing with the French constitutionnal bill (see also Legilibre/DuraLex#7). On the article 3 of the bill, it is modified the articles 41 (in section I) and 45 (in section II). In section I, the 1° adds an alinea, and the 2° changes some words on the second alinea (which becomes the third when the 1° is applied). Given the changed word appears only one time, there are no doubts that the source text in I-2° is the original text.

Do you know/is there somewhere the rules about "when restarting the cursor"? = is there some cases when you modify the text modified in an earlier section, or is the source text always the original text (in force)?

@Seb35
Copy link
Member Author

Seb35 commented Jul 18, 2018

If you want to test, it should be something like:

  • git clone https://github.com/Legilibre/SedLex && cd SedLex
  • put the attached JSON in a file pl0911-article3.json (comes from DuraLex and article 3 extracted)
  • mkdir textes && git clone -b fichiers https://github.com/Assemblee-Citoyenne/constitution-francaise-4-octobre-1958 textes/constitution
  • cat pl0911-article3.json|./bin/sedlex --diff --repository=textes|jq -r '..|.diff?|strings'

You will remark that the diffs corresponding to I-1° and II are correct, but the one for I-2° is not. I guess there are some issue with the cursor, but if the source text is always the original text, then fixing the issue will be different.

{"children":[{"children":[{"children":[{"children":[{"children":[{"children":[{"children":[{"order":1,"type":"alinea-reference"}],"id":"41","type":"article-reference"}],"id":"constitution","type":"code-reference"},{"children":[{"type":"quote","words":"Les propositions de loi ou les amendements qui ne sont pas du domaine de la loi ou qui, hors le cas des lois de programmation, sont dépourvus de portée normative, et les amendements qui sont sans lien direct avec le texte déposé ou transmis en première lecture ne sont pas recevables.\nS'il apparaît au cours de la procédure législative qu'une proposition de loi ou un amendement est contraire à une habilitation accordée en vertu de l'article 38, le Gouvernement ou le président de l'assemblée saisie peut opposer l'irrecevabilité."}],"type":"word-definition"}],"editType":"replace","type":"edit"}],"order":1,"type":"header2"},{"children":[{"children":[{"children":[{"children":[{"children":[{"children":[{"children":[{"type":"quote","words":"intéressée"}],"position":"after","type":"word-reference"}],"order":2,"type":"alinea-reference"}],"id":"41","type":"article-reference"}],"id":"constitution","type":"code-reference"},{"children":[{"type":"quote","words":"sur une irrecevabilité au titre de l'un des cas prévus aux deux alinéas précédents"}],"type":"word-definition"}],"editType":"add","type":"edit"}],"order":1,"type":"header3"},{"children":[{"children":[{"children":[{"children":[{"children":[{"children":[{"type":"quote","words":"huit jours"}],"type":"word-reference"}],"order":2,"type":"alinea-reference"}],"id":"41","type":"article-reference"}],"id":"constitution","type":"code-reference"},{"children":[{"type":"quote","words":"trois jours pour les amendements et de huit jours pour les propositions de loi, dans les conditions fixées par la loi organique"}],"type":"word-definition"}],"editType":"replace","type":"edit"}],"order":2,"type":"header3"}],"order":2,"type":"header2"}],"order":1,"type":"header1"},{"children":[{"children":[{"children":[{"children":[{"children":[{"order":2,"type":"sentence-reference"}],"order":1,"type":"alinea-reference"}],"id":"45","type":"article-reference"}],"id":"constitution","type":"code-reference"}],"editType":"delete","type":"edit"}],"order":2,"type":"header1"}],"content":"I. - L'article 41 de la Constitution est ainsi modifié :\n1° Le premier alinéa est remplacé par les dispositions suivantes :\n\"Les propositions de loi ou les amendements qui ne sont pas du domaine de la loi ou qui, hors le cas des lois de programmation, sont dépourvus de portée normative, et les amendements qui sont sans lien direct avec le texte déposé ou transmis en première lecture ne sont pas recevables.\n\"S'il apparaît au cours de la procédure législative qu'une proposition de loi ou un amendement est contraire à une habilitation accordée en vertu de l'article 38, le Gouvernement ou le président de l'assemblée saisie peut opposer l'irrecevabilité.\" ;\n2° Le deuxième alinéa est ainsi modifié :\na) Après le mot : \"intéressée\" sont insérés les mots : \"sur une irrecevabilité au titre de l'un des cas prévus aux deux alinéas précédents\" ;\nb) Les mots : \"huit jours\" sont remplacés par les mots : \"trois jours pour les amendements et de huit jours pour les propositions de loi, dans les conditions fixées par la loi organique\".\nII. - La seconde phrase du premier alinéa de l'article 45 est supprimée.","isNew":false,"order":3,"type":"bill-article"}],"date":"2018-5-9","id":911,"legislature":15,"place":"assemblée nationale","type":"projet de loi","url":null}

Seb35 added a commit that referenced this issue Jul 18, 2018
At least in some texts (XVe-911, constitutional law), the source text is
always the original text.

Issue: #1
@Seb35
Copy link
Member Author

Seb35 commented Jul 19, 2018

@promethe42: this could interest you: interesting/difficult issues on sight!

With the commit eb0aa26 (always restart the text to the original text) the article 3 of project 911 works entirely (and I am preparing some changes to improve typography like orphan spaces at the end of sentences).

On the longer term it poses a difficult issue: we have to merge each individual change, and hence we need some robust tool to merge. Instead of using a text-based merge we can do merges on DuraLex trees. (I just tried git-merge on this article 3, and I had to manually resolve the conflict :-/)

The source text of the DuraLex tree is always the same (more or less the text in force), but after each individual change applied by SedLex (each verb) the DuraLex tree needs to be "rebased": e.g. when an alinea is added on the beginning of an article, we must increment the alinea counter [of this specific article] in the further changes; then the diffs generated by SedLex could be different than the original ones. If we do something like this, we need a loop FOR 1/ SedLex-diff ; 2/ SedLex-rebase ROF to apply a set of changes on a text.

PS: perhaps I’m a bit enthusiastic, but on the very long term, such an infrastructure should enable a git-blame (or pijul-credit) on character level, even leading to the amendment if we have enough good-quality data. E.g. we would be able to see that the last "e" in "menacées" in article 16 of the Constitution has be introduced (or not) in amendment 1924 of this constitutional law (wisdom of Commission and Government on this change ;-).

@mdamien
Copy link

mdamien commented Jul 21, 2018

About your cursor question, I guess it's like the pastilles system: At first all the alineas are marked with their number and they keep it until the end.
Even after other modifications have been applied, it makes it easier to write amendment / law projects.

+1 for pijul character-based credit, would love to see it everywhere !

I applied the changes (with cat plconstit.json |./bin/sedlex --diff --repository=textes|jq -r '..|.diff?|strings' | patch -p0) and the output is the same as the PR here.
You would also need to apply Article 5 too (which aims the fourth alinea) to have the same result. By the way, it's funny to see that it adds the word "paritaire" which was missing. This law project is full of typo-fixes.

@Seb35
Copy link
Member Author

Seb35 commented Jul 22, 2018

Ok, thanks for the precision about the pastilles.

Even if individual changes in articles 1 to 10 work (with a small fix in DuraLex trees of articles 6 and 7), the merges could need a rebase operation depending on the articles of the Constitution.

@RouxRC
Copy link

RouxRC commented Jul 23, 2018

Ola,
My understanding is the law proposal does not suppose all of it will be adopted, so the cursor should not change and indices of alineas referred to should always be the ones from the original text in its version just before the bill.
But indeed, if we want one commit per amendment or per article, this creates issues to handle things properly: I guess all diffs from a same amendment or article should always be passed within the same commit

@Seb35
Copy link
Member Author

Seb35 commented Aug 9, 2018

By reviewing the adopted amendments of the constitutional law, I see another issue related to this one: these two amendments 773 and 1047 add two different words (« , des mers et des océans. » and « , de la biodiversité. ») after a same word (« environnement »). Here the two added words can be added as a list of expressions without any difficulty, but it could not always be like that; also I’m not sure if there is a canonical order to apply these conflicting patches or if this is a case where SedLex should trigger a warning. If an automatic merging is done, typography should be handled in this case (the full stop is wrong here in both amendments).

Do you have some experience with the state-of-the-art in this case? And do you have an opinion about how should behave SedLex?

@Seb35
Copy link
Member Author

Seb35 commented Aug 9, 2018

In fact the issue mentionned above is twice: in the projet de loi, it will be added two new articles after the article 2 in some (undefined or not undefined?) order, and when the projet de loi will be applied with these two new articles the issue mentionned above will occur. Or possibly these two amendments will be merged into a single article by the services de l’Assemblée and they will manually resolve the conflict.

We are here in a fine issue, which only occur when merging two amendments, this is not important in the short-medium term.

@RouxRC
Copy link

RouxRC commented Aug 15, 2018

Interesting case indeed! I don't think the National assembly services will merge the two articles, they should remain as two extra articles after article 2 and I guess in the end (if the text goes any further...) it would be the SGG that would decide how to implement the two additions into the constitution.
I'm totally guessing here, but considering they like to follow strict rules, I'd imagine they will apply the modifications mechanically in the order they arrive, which would in the present case do the second insertion in between environnement and the first insert?...

@Seb35
Copy link
Member Author

Seb35 commented Oct 7, 2018

I just thought about this issue. Currently the diffs generated by SedLex are only at the level of the verbs, roughly each sentence in each article in a pjl/ppl or in an amendment. It might be better to create diffs at the level of each article and globally for the pjl/ppl, but it is needed to "rebase" the future relevant references.

For instance you add an alinea at the beginning, you add a sentence after the second sentence of the second alinea, and you add some words at the end of the third sentence of the second alinea (everything referenced from the initial text). To apply all these operations in a standalone manner (without relying on a external text-merging program):

  1. operation: you add your initial alinea; rebase: in the non-processed patches (=edit verbs), you increment the alineas orders greater or equal than 1 (which are all alineas) for this specific article;
  2. operation: you add your sentence as third sentence of the third (initially second) alinea; rebase: in the non-processed patches, you increment the sentences greater or equal than 3 for this specific alinea in this specific article
  3. operation: you add your words at the end of the forth (initially third) sentence of the third (initially second) alinea; rebase: nothing (*)

(*) or for extra careful it might be thought about some mark to better manage word-level insertions, e.g. if your amendment adds some words after the word "intéressé" but a previous amendment added that words before the initial occurence…

Thinking about this, the current logic in AddDiffVisitor could be extended at larger scales, but you need to manage the states of all non-processed patches for all scales, which either would create headaches if you want an arbitrary-scale algorithm either would be some hardcoded scales. And in either case you need to recompute everything previously because it would be difficult to store these states.

Or probably a better manner would be to solve this issue #1 and the exact diffs #3 together:

  • you apply sequentially the operations by tagging them (e.g. you add the text with YOUR NEW TEXT around, you remove some text with YOUR OLD TEXT around),
  • you store each patch in a hierarchical summary (pjl = article 1 + article 2; article 1 = patches with uuid XYZ1, XYZ2, article 2 = patches with uuid XYZ3),
  • before applying an operation, you reverse the previous changes to obtain the initial text by keeping the mapping between the tagged text and the original text.

With such manner you can store easily half-computed projected texts, you can apply (almost) any combinaison of patches (assuming the patches are not dependent), you can easily construct the dependency of patches by searching tags in tags, you can have the exact diff of a combinaison of patches, you have a (git-)blame at a word-level, and your word-level git-blame is the exact diff (not externally created by an independant program).

@Seb35
Copy link
Member Author

Seb35 commented Oct 7, 2018

Also, for a git-blame-like feature, it is needed to uniquely identify a patch, roughly the amendment which lastly changed the text, but it could be also the initial pjl/ppl. I’m a bit lost in the numbers of the reference texts on the AN website, but either the URLs either the identifiers could be a first identifier although I’m not sure they are really perennial (for the URLs).

For amendment I see http://www.assemblee-nationale.fr/15/amendements/0857/AN/98.asp for instance with the législature, the reference text, the category (AN/commission), the amendment identifier.

For the reference text, I see http://www.assemblee-nationale.fr/15/textes/0857.asp and http://www.assemblee-nationale.fr/15/ta-commission/r0857-a0.asp, I’m not sure what is the role of each text (I didn’t compare exactly these two texts (and others?) as of now)

@mdamien
Copy link

mdamien commented Oct 8, 2018

The main difference is, IMHO, that /textes/ urls have the alineas numbers displayed. Sidenote, the HTML of those texts is easier to parse but it looks like they have their own parser over the original Word document to make those pages.

@RouxRC
Copy link

RouxRC commented Oct 8, 2018

I agree: /textes/ are probably the best (and easier to generate the urls generically).
Although, it only started appearing for textes since the middle of the 14th legislature, and I'm not sure it works with all texts yet.

@Seb35
Copy link
Member Author

Seb35 commented Dec 13, 2018

This is implemented, not in SedLex but in Durafront for now. Possibly it could be moved to SedLex, to be discussed. Compared to the discussion above, Durafront is currently not able to apply an amendment on an arbitrary text (e.g. an article of a pjl/ppl) but only on a code, but it will implemented it soon and I will probably use identifiers of an amendment as discussed above.

Seb35 referenced this issue in Legilibre/Durafront Jan 3, 2019
As suggested by @promethe42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants