Data Preparation provides several components for easy management, transformation and normalization of dates.
The Date meaning is the main meaning for representing dates. A valid Date is a non-ambiguous representation of a given moment.
As such, the Date meaning only recognizes as valid two standard formats:
- ISO-8601 with timezone indicator : for example 2013-05-30T15:16:13.764Z or 2013-05-30T15:16:13.764+0200
- RFC 822 : for example Tue, 22 Jan 2013 12:14:33 Z
Both these formats include timezone information and are therefore non-ambiguous. Internally, the Date type handles all dates in UTC.
All other formats need to be parsed into the Date type
When it is recognized during the preparation that a given column probably contains dates, which can be parsed to create new columns in the Date meaning, it uses the “Date (needs parsing)” meaning.
Only the Date meaning is mapped to the date storage type.
Parsing is the operation of converting a column with strings that contains a date written using a non-standard or non-ambiguous format into a column in the Date meaning.
When a column has been recognized as a Date (needs parsing), a “Parse date” operation will be automatically suggested.
Clicking it brings up the “Smart Date” feature. Smart Date automatically detects the probable format in which the Date (needs parsing) column is written, and generates the proper parsing processor.
Once Smart date shows the possible formats :
- Select the proper format
- Verify using the automatically generated examples that the parsing corresponds to what you expected.
- Click OK
- A new Date parser processor is generated and your new Date column appears.
Sometimes, Smart Date cannot automatically recognize your format. In that case, you need to use the “Custom format” tab of the Smart Date box.
As you type your format, using the Java DateFormat syntax (LINK LINK), Data Science Studio checks your syntax in real time and indicates whether your pattern properly parses the source data, whether it parses it partially, and it displays the result of the parsing in real time
Some formats of dates include internationalizable elements like month names or day names.
By default, the Date Parser automatically parses these elements in both french and english. You can also force a specific locale for these internationalized elements.
The « Date » format generated by the Date parser processor always includes the complete timezone information.
Some date formats natively include timezone information (like +0200, CEST, UTC, …). These are denoted by the Z character in the date parsing pattern. For these formats, the date parser will use the timezone information to generate the proper « Date » column.
However, most formats (for example : 2013/04/17-13:23:32) do not include timezone information. For these formats, the Date parser needs to know in which timezone this value is. The date parser has three ways to determine the time zone for a given row :
Using a static value (like « UTC » or « Europe/Paris ») : you indicate that all your rows are at a given timezone, and Date parser will use this. This is useful for example, for timestamps in server log files, where all servers are at the same timezone.
Using a timezone column. If your row contains a column with timezone information, Date parser can use it directly. This allows you to have a different timezone per row. You need to configure which column contains the timezone information.
If for a given record, the timezone information is invalid, the Date parser does not output a date for this row
Using an IP address column. If your row contains an IP address, and you know that the timezone of the row is the timezone of the IP address (for example, a client-generated timestamp in a web browsing log). The Date parser will automatically geolocate this IP address and use the timezone of the detected location. You need to configure which column contains the IP address.
If the Date Parser cannot geolocate the IP, it does not output a date for this row
Columns that contain a UNIX timestamp are handled separately. You do not parse them using the Date Parser / Smart Date combination. Instead, use the dedicated processor.
UNIX timestamps can come into two variants : in seconds since Epoch or in milliseconds since Epoch.
You need to indicate which format your column is in.
UNIX timestamps are always expressed in UTC.
Once you have a column in proper non ambiguous format, with the « Date » type, you can perform various operations on this column:
- Extract some components fo the date (year, month, day, week, day of week, ..) into separate columns
- Reformat the date in another format
- Flag rows based on configurable calendars.
- View facets based on time divisions
- Use it in Visualize and break down by various time ranges
This processor allows you to easily extract components from the date into separate columns. For example, you could create a column with the day of the week for each row. The day of the week is generally a very good feature for machine learning.
The components that can be extracted from a Date column are :
- Month (01 = January, 12 = December)
- Week of year
- Day of week (1 = Monday, 7 = Sunday)
In the Date column you are extracting from, Data Science Studio performs all computations in UTC. However, it is often useful to extract information in a different timezone.
For example, imagine that you are processing web log files containing page events coming from all around the world. The Date column that indicates the timestamp of the event is always aligned on the UTC timezone. What we want to know is at what time of the day the most events happen. However, we want this information in the local timezone of the client that generated this event. For example, we might want to know the proportion of events that happen on the morning for the client rather than for the server
To help you with that, while extracting components, Data Science Studio can «realign » them on a different timezone.
Like for the Date Parser, this timezone can be specified using 3 different ways:
- Using a static value (like « UTC » or « Europe/Paris »). All components for all rows will be output on this timezone.
- Using a timezone column. If your row contains a column with timezone information, the extractor can use it directly. This allows you to have a different timezone per row. You need to configure which column contains the timezone information.
- Using an IP address column. If your row contains an IP address, and you know that the timezone of the row is the timezone of the IP address (for example, a client-generated timestamp in a web browsing log). The extractor will automatically geolocate this IP address and use the timezone of the detected location. You need to configure which column contains the IP address.i
We have a web log file with a Date, in UTC, which is the date of the hit on the server, and with the IP of the client
If we extract the day and the hour using UTC timezone, we get them for the server UTC timezone. This does not tell us at what time of their days customers come on the website
If we use the IP column as timezone source, Data Science Studio geolocates each IP, and uses the timezone of the IP to automatically translate the date components in the local timezone.
- The hits at 11pm UTC in France are actually at 1am local time (GMT + 2 due to DST)
- The hits at 12am UTC in the US are actually at 8pm local time (GMT - 4 due to DST)
The Date formatter processor allows you to recreate a Date as a « human-readable » string. Like the components extractor, the date formatter allows you to realign dates on a local timezone.
The format of the Date Formatter must be specified using the Java DateFormat specification.
The formatter can output dates in french or english.