Whitespace between headers #6

bradenneal1 · 2020-08-31T04:11:47Z

The regular expression MESSAGE_REGEX does not allow whitespace (or newlines) between each header. For example, if the test MESSAGE_1 is defined as:

MESSAGE1 = """{1:F01ASDFJK20AXXX0987654321}
{2:I103ASDFJK22XXXXN}
{4: :20:20180101-ABCDEF :23B:GHIJ :32A:180117CAD5432,1 :33B:EUR9999,0 :50K:/123456-75901 SOMEWHERE New York 999999 GR :53B:/20100213012345 :57C://SC200123 :59:/201001020 First Name Last Name a12345bc6d789ef01a23 Nowhere NL :70:test reference test reason payment group: 1234567-ABCDEF :71A:SHA :77B:Test this
-}"""

It does not parse:

>>> import mt103
>>> message = mt103.MT103(MESSAGE1)
>>> message.text
>>>

Redefining the regex to accept whitespace characters between headers:

MESSAGE_REGEX = re.compile(
    r"^"
    r"({1:(?P<basic_header>[^}]+)})?\s*"
    r"({2:(?P<application_header>(I|O)[^}]+)})?\s*"
    r"({3:"
        r"(?P<user_header>"
            r"({113:[A-Z]{4}})?"
            r"({108:[A-Z 0-9]{0,16}})?"
            r"({111:[0-9]{3}})?"
            r"({121:[a-zA-Z0-9]{8}-[a-zA-Z0-9]{4}-4[a-zA-Z0-9]{3}-[89ab][a-zA-Z0-9]{3}-[a-zA-Z0-9]{12}})?\s*"  # NOQA: E501
        r")"
    r"})?"
    r"({4:\s*(?P<text>.+?)\s*-})?\s*"
    r"({5:(?P<trailer>.+)})?"
    r"$",
    re.DOTALL
)

solves the issue

>>> import mt103
>>> message = mt103.MT103(MESSAGE1)
>>> message.text
:20:20180101-ABCDEF :23B:GHIJ :32A:180117CAD5432,1 :33B:EUR9999,0 :50K:/123456-75901 SOMEWHERE New York 999999 GR :53B:/20100213012345 :57C://SC200123 :59:/2010
01020 First Name Last Name a12345bc6d789ef01a23 Nowhere NL :70:test reference test reason payment group: 1234567-ABCDEF :71A:SHA :77B:Test this

The text was updated successfully, but these errors were encountered:

danielquinn · 2020-11-04T13:32:55Z

I'm not sure about this one. Is there somewhere in the spec that says it's ok to have newlines in these locations and not others? Your suggested changes are simple enough, and making your suggested changes does indeed mean that you can parse a message with new lines in it, but I'm not clear on whether the mt103 message in question is valid with new lines in it, or that your suggested placements for new lines represents all the cases where this would be a problem. Do you have a spec I can reference for confirmation?

I ask because the placement of the \s* bits seems strangely arbitrary. You've got one after every section except 5, and they only appear after a header but not between sections

If this is valid:

{1:F01ASDFJK20AXXX0987654321}
{2:I103ASDFJK22XXXXN}
{4: :20:20180101-ABCDEF :23B:GHIJ :32A:180117CAD5432,1 :33B:EUR9999,0 :50K:/123456-75901 SOMEWHERE New York 999999 GR :53B:/20100213012345 :57C://SC200123 :59:/201001020 First Name Last Name a12345bc6d789ef01a23 Nowhere NL :70:test reference test reason payment group: 1234567-ABCDEF :71A:SHA :77B:Test this
-}

Is this not?

{1:F01ASDFJK20AXXX0987654321}
{2:I103ASDFJK22XXXXN}
{
4: :20:20180101-ABCDEF :23B:GHIJ :32A:180117CAD5432,1 :33B:EUR9999,0 :50K:/123456-75901 SOMEWHERE New York 999999 GR :53B:/20100213012345 :57C://SC200123 :59:/201001020 First Name Last Name a12345bc6d789ef01a23 Nowhere NL :70:test reference test reason payment group: 1234567-ABCDEF :71A:SHA :77B:Test this
-}

Might it be better to just message.replace("\n", "") before parsing it, or is that likely to break things elsewhere? Until I'm certain, I'm not keen on making this change. If you have something I can reference to be sure, that'd go a long way toward helping me figure this out.

bradenneal · 2020-11-04T20:08:40Z

I don't have a specification to provide unfortunately.

I initially was using message.replace("\n", ""), but became unstuck when parsing tags which contain more than 1 component. For example, if the above message was formatted:

{1:F01ASDFJK20AXXX0987654321}
{2:I103ASDFJK22XXXXN}
{4:
:20:20180101-ABCDEF
:23B:GHIJ
:32A:180117CAD5432,1
:33B:EUR9999,0
:50K:/123456-75901
SOMEWHERE
New York
999999
GR
:53B:/2010021301234
:57C://SC200123
:59:/201001020 
First Name Last Name
a12345bc6d789ef01a23
Nowhere
NL
:70:test reference
test reason
payment group:
1234567-ABCDEF
:71A:SHA
:77B:Test this
-}

Both 50K and 59 tags follow a format of Account, Name1, Name2, Address, City/Postal Code. With the newline characters removed, there is no way to determine where "Account" finishes and "Name1" starts etc. Keeping the newlines (and making the parser newline insensitive) allows message.ordering_customer.split('\n') to identify the individual components.

bradenneal · 2020-11-04T20:09:52Z

You've got one after every section except 5

That's an oversight on my behalf. I would consider a message with trailing whitespace still valid (but have simply never seen one)

danielquinn · 2020-11-09T15:41:39Z

Alright I've had a conversation with some more financially-minded (as opposed to software like me) -people and it looks like line breaks are common in a message, so I'm going to make this change.

Do you perhaps have a few test messages I can use to ensure that everything works as-expected? All of the messages I have access to have no line breaks.

Repository owner deleted a comment from studentforcode Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace between headers #6

Whitespace between headers #6

bradenneal1 commented Aug 31, 2020

danielquinn commented Nov 4, 2020

bradenneal commented Nov 4, 2020

bradenneal commented Nov 4, 2020

danielquinn commented Nov 9, 2020

Whitespace between headers #6

Whitespace between headers #6

Comments

bradenneal1 commented Aug 31, 2020

danielquinn commented Nov 4, 2020

bradenneal commented Nov 4, 2020

bradenneal commented Nov 4, 2020

danielquinn commented Nov 9, 2020