Ignore accented letters #120

juanfal · 2023-08-10T17:08:19Z

juanfal
Aug 10, 2023

I love bfs and I using it since I met it, and telling others to do so. The only thing I really miss, and I sometimes resort to Spotlight terminal search, much much slower and more difficult to control, is when I want to ignore accented chars. That is, when mecánica.pdf is there but you don't know if it has the accent or not, so

bfs /Volumes/Libros/BOOKS -iregex ".*$*.*" 2>/dev/null

would fail. I miss -iuregex

tavianator · 2023-08-10T18:54:36Z

tavianator
Aug 10, 2023
Maintainer

I can definitely see why that would be useful! It's also probably hard to implement.

Related: sharkdp/fd#638

0 replies

juanfal · 2023-08-11T08:46:31Z

juanfal
Aug 11, 2023
Author

I totally agree, IMHO the problem is on the roof of the implementation of regular expressions, which are completely lacking in this very concept. Using

-E -iregex 'corazo.?n'

would match correctly corazon as well as corazón what is the closest I can get to it.

So

-E -iregex 'c.?o.?r.?a.?z.?o.?n.?'

could make a poor -iuregex

2 replies

tavianator Aug 11, 2023
Maintainer

That will only work if the ó is decomposed into o + a combining accent. This happens naturally on macOS due to NFD normalization, but on other platforms it's more likely a precomposed ó will be used, which won't be matched by o.?. -iregex 'coraz[[=o=]]n should work in both cases, if the regex engine supports it.

tavianator Aug 11, 2023
Maintainer

Having just tested it, it appears Oniguruma does not support that syntax: kkos/oniguruma#288

tavianator · 2023-08-12T15:26:41Z

tavianator
Aug 12, 2023
Maintainer

One potential idea would be to use https://github.com/laurikari/tre (agrep's regex implementation) to do approximate matching. E.g.

$ echo corazón | agrep -0 'corazon'  # Exact match
$ echo corazón | agrep -1 'corazon'  # Approximate match, within 1 edit
corazón

It could be something like:

$ bfs -aregex 1 '.*/corazon'
./corazon
./corazón

0 replies

juanfal · 2023-08-15T19:22:21Z

juanfal
Aug 15, 2023
Author

It is a great approach! It also returns different letters as good but, it is a good compromise. $ echo corazén | agrep -1 'corazon' corazén The only (heuristic) solution I see based on what I have found (iconv, uregex <https://www.unicode.org/reports/tr18/#character_ranges>, equivalent non-accented chars <https://www.drillio.com/en/2011/java-remove-accent-diacritic/>, another <https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net>) is the heuristic list like the next into account when using this -uregex [ { "äæǽ", "ae" }, { "öœ", "oe" }, { "ü", "ue" }, ...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore accented letters #120

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Ignore accented letters #120

juanfal Aug 10, 2023

Replies: 4 comments · 2 replies

tavianator Aug 10, 2023 Maintainer

juanfal Aug 11, 2023 Author

tavianator Aug 11, 2023 Maintainer

tavianator Aug 11, 2023 Maintainer

tavianator Aug 12, 2023 Maintainer

juanfal Aug 15, 2023 Author

juanfal
Aug 10, 2023

Replies: 4 comments 2 replies

tavianator
Aug 10, 2023
Maintainer

juanfal
Aug 11, 2023
Author

tavianator Aug 11, 2023
Maintainer

tavianator Aug 11, 2023
Maintainer

tavianator
Aug 12, 2023
Maintainer

juanfal
Aug 15, 2023
Author