Archive for category Python
With the pandas library, if you have read in a csv file with a date field that is sometimes empty and are getting the error:
TypeError: can't compare offset-naive and offset-aware datetimes
It may be caused by dateutil.parser.parse which is the default function for parsing dates when the csv file is read. This function returns the current non-timezone aware date when given an empty string as input. According to the dateutil documentation “the default value is the current date, at 00:00:00am.” This causes confusion in the context of pandas for three reasons:
- the data in the DataFrame is not derived from the source CSV file
- the expected value of an empty field is numpy.NaN.
- the returned datetime object is not time zone aware.
Fortunately pandas.read_csv has the date_parser argument which allows you to include your own parsing function. One might thing that the following function is the right fix…
def date_or_nan_parse(str): if not(str): return numpy.NaN return dateutil.parser.parse(str)
BUT NO. If you then try to compare a date against the pandas.Series of the dates the same “TypeError: can’t compare offset-naive and offset-aware datetimes” may get thrown. In fact, both of these will fail:
df['datefield'] > dateutil.parser.parse("January 1 1901 00:00 UTC") df['datefield'] > dateutil.parser.parse("January 1 1901 00:00")
This time the problem is that the comparison fails on the numpy.NaN values.
My fix is:
OLD_DATE = dateutil.parser.parse("January 1, 1901 00:00 UTC") def date_or_nan_parse(str): if not(str): return OLD_DATE return dateutil.parser.parse(str)
In this case I introduce a new value rather than (the more correct) numpy.NaN. The new value is a date so it doesn’t fail during comparison operations (assuming that OLD_DATE has the same time zone awareness as the rest of your data). And at least it is a date that I’ve explicitly chosen, rather than just the current date.
I hope this saves you some confusion.