Finding all timestamps in a page using regular expressions

Multi tool use
Finding all timestamps in a page using regular expressions
I'm trying to find the publication date of newspaper articles published online using Python, but each website uses their own unique style for their html and the publication time in the page meta isn't consistent between different domains.
I've tried using the dateparser package, but it includes a relative dating system that incorrectly reads some words (like the string: 'a day') as relative times.
Is there a good list of regular expressions out there that someone knows of/can share that includes as many ways to format a timestamp as possible, including support for reading timezones?
dateparser
1 Answer
1
In general, no - this task is not possible because humans infer context that you are probably not taking into account.
Consider if your code encountered a string like 01/05/13
. What date is that? Is it January 5th 2013? Or maybe it's May 1st 2013? Or May 13th 1801? A human reader might pick up on the contexts of localization and century of publication, but unless you supply them separately - computer code will not.
01/05/13
Likewise, consider if your code encountered a string like 3.14
. Is it March 14th? Or is it an approximation for the mathematical symbol π ? Without context of understanding the surrounding text, it's impossible to know.
3.14
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Just a suggestion: you might be better off maintaining separate, simple patterns for each website you want to handle. There are myriad ways to format timestamps even without considering internationalization; the more you broaden your search, the more false positives you’ll have, which brings you back to where you are already with
dateparser
.– Aankhen
Jul 2 at 23:29