One of the most difficult tasks in data processing and cleansing is to handle dates. The main problem is that they appear in many different formats which adds extra complexity. Today, we will give an example of how we can easily detect different date formats in a document and convert all of them in a consistent unique format, our preference.
Detect different Date format in a text document
Let’s consider the following document:
“Today is 29th of May, 2020. In UK is written as 05/29/2020, or 5/29/2020 or you can even find it as 5/29/20. Instead of /
you can either use -
or .
like 5-29-2020 or 5.29.2020. In Greece, the date format is dd/mm/yy, which means that today is 29/05/2020 and again we can use dots or hyphen between date parts like 29-05-20 or 29.05.2020. We can also add time, like 29/05/2020 19:30. Personally, my favorite date format is Y/mm/dd so, I would be happy to convert easily all these different dates to 2020/05/29“
We will use the datefinder library to detect the different date formats.
import re import datefinder my_txt="""Today is 29th of May, 2020. In UK is written as 05/29/2020, or 5/29/2020 or you can even find it as 5/29/20. Instead of / you can either use - or . like 5-29-2020 or 5.29.2020. In Greece, the date format is dd/mm/yy, which means that today is 29/05/2020 and again we can use dots or hyphen between date parts like 29-05-20 or 29.05.2020. We can also add time, like 29/05/2020 19:30. Personally, my favorite date format is Y/mm/dd so, I would be happy to convert easily all these different dates to 2020/05/29""" matches = datefinder.find_dates(my_txt, source=True) detect_dates=[] for match in matches: detect_dates.append(match) detect_dates
[(datetime.datetime(2020, 5, 29, 0, 0), '29th of May, 2020'),
(datetime.datetime(2020, 5, 29, 0, 0), '05/29/2020'),
(datetime.datetime(2020, 5, 29, 0, 0), '5/29/2020'),
(datetime.datetime(2020, 5, 29, 0, 0), '5/29/20'),
(datetime.datetime(2020, 5, 29, 0, 0), '5-29-2020'),
(datetime.datetime(2020, 5, 29, 0, 0), '5.29.2020'),
(datetime.datetime(2020, 5, 29, 0, 0), '29/05/2020'),
(datetime.datetime(2020, 5, 29, 0, 0), '29-05-20'),
(datetime.datetime(2020, 5, 29, 0, 0), '29.05.2020'),
(datetime.datetime(2020, 5, 29, 19, 30), '29/05/2020 19:30'),
(datetime.datetime(2020, 5, 29, 0, 0), '2020/05/29')]
As we can see we managed to detect all the different date formats of the document and we have their corresponding datetime format.
Replace the different dates in a unique format
Let’s assume that we want to convert all the different date formats to a unique one such as yyyy/mm/dd
. Below we provide the example:
# First we will need to sort (reverse order) the dates that we want to replace sorted_detect_dates = sorted(detect_dates, key=lambda tup: len(tup[1]), reverse=True) clean_txt = my_txt for i in range(len(detect_dates)): clean_txt = re.sub(detect_dates[i][1], "{:%Y/%m/%d}".format(detect_dates[i][0]), clean_txt) clean_txt
'Today is 2020/05/29. In UK is written as 2020/05/29, or 2020/05/29 or you can even find it as 2020/05/29. Instead of / you can either use - or . like 2020/05/29 or 2020/05/29. In Greece, the date format is dd/mm/yy, which means that today is 2020/05/29 and again we can use dots or hyphen between date parts like 2020/05/29 or 2020/05/29. We can also add time, like 2020/05/29 19:30. Personally, my favorite date format is Y/mm/dd so, I would be happy to convert easily all these different dates to 2020/05/29'
Voila! we managed to convert the different date format to a unique one (yyyy/mm/dd) as we can see better below:
“Today is 2020/05/29. In UK is written as 2020/05/29, or 2020/05/29 or you can even find it as 2020/05/29. Instead of / you can either use – or . like 2020/05/29 or 2020/05/29. In Greece, the date format is dd/mm/yy, which means that today is 2020/05/29 and again we can use dots or hyphen between date parts like 2020/05/29 or 2020/05/29. We can also add time, like 2020/05/29 19:30. Personally, my favorite date format is Y/mm/dd so, I would be happy to convert easily all these different dates to 2020/05/29“