Publication Date Transformation

Parent Previous Next

As of version 2.8.0, multiFEED allows you to transform invalid date strings from feeds into valid RFC3339 formatted dates so multiFEED can read them correctly. Doing this requires some skill writing regular expressions.


There are three parts to an RFC3339 date string...


A properly formatted RFC3339 date looks like this...


<date>T<time>Z


or...


<date>T<time><offset>


The date should include either 'Z', or an offset, not both. Here are a few real examples...



In that last example the offset is +0000, which is the same as if 'Z' was used instead. Either one is valid. Also note that there is no 'T' between the date and time portions of the date string. Although it is traditional to place the 'T' there, it is valid to use a space instead.


To transform an invalid date you must write a regular expression which captures the various parts of an RFC3339 date from the date string in the feed, and then paste them all back together into a valid RFC3339 date string. Here's a real-life example from an actual published feed...


For some reason, the Let's Encrypt RSS feed suddenly changed the format of the article dates from a valid RFC822 format to a format neither RFC822 nor RFC3339, causing multiFEED to start reading the article dates incorrectly. A sample of this invalid date format is...


2018-04-04 16:00:00 +0000 UTC


As you can see, it is similar to RFC3339, but there is an extra space between the time and offset fields and the extra characters ' UTC' tacked on the end. Unfortunately this is enough of a difference that the date parser in multiFEED is unable to figure out what date and time it is describing. The first task is writing the regular expression to extract the required parts for the transformation. Regular expressions are very powerful and there are often many ways to accomplish the same task, but this one will work...


(\d{4}-\d\d-\d\d)\s+(\d\d-\d\d-\d\d)\s+([-\+]\d{4})\s+UTC


This is not a tutorial on regular expressions, but here is a simplified explanation of what this expression does...



Running this regular expression against the invalid date grabs three substrings, the date, the time, and the offset (with sign). The rest of the invalid date string is discarded. Now lets write the transformation string to recombine all three parts into a valid RFC3339 date...


\1T\2\3


That's it! This transformation string outputs the '2018-04-04' followed by a 'T' and then the '16:00:00' immediately followed by the '+0000' offset...


2018-04-04T16:00:00+0000


... which is a valid RFC3339 date format. Entering this regular expression and transformation string into multiFEED allows it to successfully read the date and time of articles in the Let's Encrypt feed. This example was fairly simple since the invalid date was quite close to a valid format, but regular expressions are powerful enough to correctly transform even radically different date/time formats.


The substrings captured by the regular expression are referenced in the transformation string with a backslash and the ordinal number of the capture, counting from left to right. For instance, '\5' refers to the fifth substring captured by the regular expression. As noted, parenthesis are used to indicate the groups of characters to be captured. For more information on the syntax of regular expressions understood by multiFEED, see here.