Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress pptxt report of adjacent spaces/dashes #641

Merged

Conversation

windymilla
Copy link
Collaborator

Instead of reporting every occurrence of multiple spaces or multiple dashes separately, report once per line, with a count of how many occurrences, e.g.
23.4: (x3) a ---- a ---- a ---- a

Fixes #640

Instead of reporting every occurrence of multiple spaces
or multiple dashes separately, report once per line,
with a count of how many occurrences, e.g.
`23.4: (x3) a ---- a ---- a ---- a`
@windymilla windymilla requested review from rtonsing and srjfoo January 4, 2025 20:52
@windymilla
Copy link
Collaborator Author

Testing notes:
This file from the linked issue has both spaces and dashes.
Run Tools->PPtext and scroll down results to see compressed output. In master each occurrence is reported on a separate line.

Copy link

@rtonsing rtonsing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! I like the count shown in red.

There are still repetitions in "Character checks" for '+' and '|' if you want to handle those.

@windymilla
Copy link
Collaborator Author

There are still repetitions in "Character checks" for '+' and '|' if you want to handle those.

OK - those are now done.

Prior to this, PPtxt was very much based on the online version, which was based on quite an old tool. IMHO, we should consider a few more changes, for example, several of the characters in "Character checks" are by no means suspicious these days, now that we're not restricted to ASCII. Fractions would be one example, but I'm sure there are others that often appear in that check - if too many characters are reported, the check becomes useless because the user can't be bothered to check all of them.

@rtonsing
Copy link

rtonsing commented Jan 4, 2025

Verified.

I agree about fractions and some other characters, super/subscripts for example. but I'm sure we will need multiple opinions. I'm not sure about the em-dash check either.

@srjfoo
Copy link
Member

srjfoo commented Jan 5, 2025

Okay, I've got a question for y'all. The adjacent space check picks up on lines 360, 361 and 370. What about lines 361-366 and 371-375? They have a bunch of adjacent spaces, too. The only difference I see is that the lines that were identified have something in the first column, and the lines that weren't don't have anything in the first column. Is that because of the "(poetry, block-quotes, etc., are ignored)" comment.

I'm assuming that's correct behavior, but since I'm not used to it, it doesn't quite make sense to me, so I thought I'd ask -- it's certainly the same as what's in master.

The other thing that jumped out at me was the HTML tags getting highlighted for the / in the closing tag. Oh. Looking closer, there must be logic to abandon the check after a certain number of hits, after concluding that they're HTML tags? And since the <i> and <sc> elements are closer to the top of the file, the check is abandoned before it gets to the <b>s?

Anyway, since y'all didn't mention the table rows that have multiple spaces, but nothing in the first column, I'm assuming that you're okay with that. I'll go ahead and approve, because it certainly seems to work well otherwise, and matches the way the old version selects rows to show.

@windymilla
Copy link
Collaborator Author

Okay, I've got a question for y'all. The adjacent space check picks up on lines 360, 361 and 370. What about lines 361-366 and 371-375? They have a bunch of adjacent spaces, too. The only difference I see is that the lines that were identified have something in the first column, and the lines that weren't don't have anything in the first column. Is that because of the "(poetry, block-quotes, etc., are ignored)" comment.

I'm assuming that's correct behavior, but since I'm not used to it, it doesn't quite make sense to me, so I thought I'd ask -- it's certainly the same as what's in master.

Whether it's "correct" or not is a matter of opinion, I guess, but it is "intended", and AFAIK the same as other versions of pptext.

The other thing that jumped out at me was the HTML tags getting highlighted for the / in the closing tag. Oh. Looking closer, there must be logic to abandon the check after a certain number of hits, after concluding that they're HTML tags? And since the <i> and <sc> elements are closer to the top of the file, the check is abandoned before it gets to the <b>s?

Quite a few of the checks in pptext have limits on the number of similar warnings they output. In fact, I think pptext is/was supposed to be run on a text file, so there should be no <i>, <sc>, etc., by that stage. It would also mean that the whole table would be indented (rather than left-aligned but enclosed in /X markup) so it wouldn't report multiple spaces, etc.

At some point, I think we should consider how much we want to allow pptext to move away from the historical behavior of the old GG1 version and the online version.

There are several changes that could potentially make it more useful:

  • the way it lists the first 5 errors of one type, and then just says "there are more of these"
  • several more characters could be removed from the ones it flags as unusual
  • floating double quotes with 2 spaces either side should be ignored (since they are probably ditto marks)
  • remove good em-dashes from the em-dash report
  • mixed case words which are actually Capitalized-Hyphenated shouldn't be reported
  • don't report .123 as a spaced period since it has digits after it
  • add more exceptions to double punctuation report, e.g. pp., and maybe single capital letter before period, e.g. N. Y.,

@windymilla windymilla merged commit 81af1d6 into DistributedProofreaders:master Jan 5, 2025
1 check passed
@windymilla windymilla deleted the pptxt-multiples branch January 5, 2025 13:04
@rtonsing
Copy link

rtonsing commented Jan 5, 2025

Okay, I've got a question for y'all. The adjacent space check picks up on lines 360, 361 and 370. What about lines 361-366 and 371-375? They have a bunch of adjacent spaces, too. The only difference I see is that the lines that were identified have something in the first column, and the lines that weren't don't have anything in the first column. Is that because of the "(poetry, block-quotes, etc., are ignored)" comment.

I consider that a problem, although different from this original issue. I don't see any reason not to also report adjacent spaces on lines with leading spaces. Guiguts 1.x also does not report them, but online pptext does:

362: 57 | 52 | | 470 | 110 | 33 | 18 | 99 | 2972 | 704 | 340 | 2⅝ | 6⅜ | 3¼ | 8¾ | 14 | 12

@rtonsing
Copy link

rtonsing commented Jan 5, 2025

  • the way it lists the first 5 errors of one type, and then just says "there are more of these"
    

But online pptext has a 'verbose' option to list all of them

  • several more characters could be removed from the ones it flags as unusual
    

Agree

  • floating double quotes with 2 spaces either side should be ignored (since they are probably ditto marks)
    

Agree

  • remove good em-dashes from the em-dash report
    

Agree

  • mixed case words which are actually Capitalized-Hyphenated shouldn't be reported
    

Agree

  • don't report  .123 as a spaced period since it has digits after it
    

Agree

  • add more exceptions to double punctuation report, e.g. pp., and maybe single capital letter before period, e.g. N. Y.,
    

Agree

I just ignore the issues reported for HTML and block markup, but I do want to fix other issues before generating the HTML. An option to ignore standard PGDP markup would be nice, I agree they should be reported for the final text version. But of course, providing options for PPtxt would be a whole new thing.

@windymilla
Copy link
Collaborator Author

I consider that a problem, although different from this original issue. I don't see any reason not to also report adjacent spaces on lines with leading spaces. Guiguts 1.x also does not report them, but online pptext does:

362: 57 | 52 | | 470 | 110 | 33 | 18 | 99 | 2972 | 704 | 340 | 2⅝ | 6⅜ | 3¼ | 8¾ | 14 | 12

Thanks for this and your other comments @rtonsing - I'll link to this discussion from a new issue I'm going to put in as a future feature to review pptext's error output.

@rtonsing
Copy link

rtonsing commented Jan 5, 2025

Thanks, I was wondering if I should add a new issue.

It is about time I looked into making updates to online pptext, myself.

@windymilla
Copy link
Collaborator Author

It is about time I looked into making updates to online pptext, myself.

In an ideal world, they would both behave the same, but I don't think that's practical given their very different environments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Minor enhancement to PPtxt output: suppress repeated adjacent dash and character checks reporting on same line
3 participants