-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YYYY-MM-DD interpreted as YYYY-DD-MM #1199
Comments
#790 by @Gallaecio might be related, but not sure if it's enough, because the date we do get is formatted in a more weird way, as |
According to wikipedia, YDM is used in just 4 few countries: https://en.wikipedia.org/wiki/Calendar_date#Gregorian,_year–day–month_(YDM), but it looks like we're inferring that MDY (very popular) implies YDM (very rare) |
Came here to confirm this affects German as well, another language which uses DMY for local date formats, but also just found issue #765, which already reported this problem back in 2020... |
Interestingly, when it comes to ISO 8601 dates, DMY-related settings seem to also partially disable the built-in mechanism which swaps date components on impossible combinations... when that mechanism could theoretically "save" 2/3 of dates being misinterpreted (based on numbers > 12). Examples parsing correctly formatted ISO date "1960-12-23": >>> dateparser.parse("1960-12-23") # default
datetime.datetime(1960, 12, 23, 0, 0)
>>> dateparser.parse("1960-12-23", languages=["en"]) # languages set to MYD language
datetime.datetime(1960, 12, 23, 0, 0)
>>> dateparser.parse("1960-12-23", languages=["de"]) # languages set to DMY language
# None (implicit)
>>> dateparser.parse("1960-12-23", languages=["en"], settings={"DATE_ORDER": "DMY"}) # languages set to MYD language, DATE_ORDER set to DMY
# None (implicit)
>>> dateparser.parse("1960-12-23", languages=["de"], settings={"DATE_ORDER": "MDY"}) # languages set to DMY language, DATE_ORDER set to MDY
datetime.datetime(1960, 12, 23, 0, 0) ... The last two examples have the same result even with Examples parsing jumbled ISO date "1960-23-12": >>> dateparser.parse("1960-23-12") # default
datetime.datetime(1960, 12, 23, 0, 0)
>>> dateparser.parse("1960-23-12", languages=["en"]) # languages set to MYD language
datetime.datetime(1960, 12, 23, 0, 0)
>>> dateparser.parse("1960-23-12", languages=["de"]) # languages set to DMY language
datetime.datetime(1960, 12, 23, 0, 0)
>>> dateparser.parse("1960-23-12", settings={"DATE_ORDER": "DMY"}) # DATE_ORDER set to DMY
datetime.datetime(1960, 12, 23, 0, 0) ... The jumbled date is parsed correctly for all possible combinations of the above |
@lopuhin It looks like the problem can be worked around by including the format codes for >>> dateparser.parse("2023-11-08", languages=["ar"], date_formats=["%Y-%m-%d"])
datetime.datetime(2023, 11, 8, 0, 0) Other dates will continue to be interpreted as DMY: >>> dateparser.parse("8.11.23", languages=["ar"], date_formats=["%Y-%m-%d"])
datetime.datetime(2023, 11, 8, 0, 0)
>>> dateparser.parse("8/11/23", languages=["ar"], date_formats=["%Y-%m-%d"])
datetime.datetime(2023, 11, 8, 0, 0) Same for other hyphenated dates, e.g. |
Interesting, thanks for suggestion @keikoro . I wonder if passing |
I have a variation of this problem. I'm trying to parse invoice lines from a lot of different subcontractors. Much of it is in some variation of DMY or YMD, but never ever YDM, which I'm currently fighting. Refer to the code example below for some tests. I was wondering if it'd be easy to amend the The code below requires import datetime
import dateparser.search
import dateparser.conf
import rich
def extract_dates(sample, debug=True):
return dateparser.search.search_dates(
sample,
languages=['da', 'en'],
settings={
'PREFER_LOCALE_DATE_ORDER': True,
'DATE_ORDER': 'DMY',
'PREFER_DATES_FROM': 'past',
'STRICT_PARSING': True, # There must be a day, month, year
'PARSERS': ['absolute-time']
},
add_detected_language=True,
)
tests = [
("sdds 07/09/2024 1. kons", datetime.datetime(2024, 9, 7, 0, 0)),
("sdds 30. september 2024 første kons", datetime.datetime(2024, 9, 30, 0, 0)),
("sdds 30. september 2024 1. kons", datetime.datetime(2024, 9, 30, 0, 0)),
("sdds 30. september 2024 sdw 1. kons", datetime.datetime(2024, 9, 30, 0, 0)),
("sdds 2024-11-02 1. kons", datetime.datetime(2024, 11, 2, 0, 0)),
("sdds 4. kons 3. februar 2023", datetime.datetime(2023, 2, 3, 0, 0)),
("sdds 4. kons 2. marts 2023", datetime.datetime(2023, 3, 2, 0, 0)),
]
for sample, correct in tests:
results = extract_dates(sample)[0]
date_part_of_string, result, language = extract_dates(sample)[0]
if result == correct:
rich.print(f"{date_part_of_string} -- [green]{sample}[/green] -- {result}")
else:
rich.print(f"{date_part_of_string} -- [red]{sample}[/red] -- {result}") |
YYYY-MM-DD interpreted as YYYY-DD-MM for arabic, but also looks like in other languages which prefer DMY order, but this looks strange -- it seems that if year is first, then we should ignore DMY / MDY and just use YMD for all locales?
Examples:
Side note: in reality US also has MDY date order, so if we'd interpret en as en-US and if it had MDY set, then we'd parse a lot more dates incorrectly.
The text was updated successfully, but these errors were encountered: