Delimiter-separated values (CSV / TSV)

“CSV” in DSS format covers a wide range of traditional formats, including comma-separated values (CSV) and tab-separated values (TSV).

Despite its apparent simplicity, there are subtleties in the DSV format. Data Science Studio reads the DSV according to the specifications you give, so if the file doesn’t conform to the spec, you may end up with misaligned, misplaced fields, or fields consisting of almost all the file (and out of memory issues).

Quoting and escaping styles

The main problem with DSV is the handling of special characters within the fields.

Several methods exist, and Data Science Studio supports all of the following.

Excel-style

Also known as RFC4180, the most common variant of DSV.

  • The quoting char is a double quote (")

  • If the separator char appears in a field, the field is enclosed in the quoting char

  • If a \n or \r char appears in a field, the field is also enclosed in the quoting char

  • If the quoting char appears in the field the quoting char itself is doubled

Example

With separator: ,

Input

a normal,line
"the,delimiter","in,a field"
"but ""also""","double ""quotes"", in a field"

Output:

a normal            line
the, delimiter      in, a field
but "also"          double "quotes", in a field

Unix-style

  • If the separator char appears in a field, the field is enclosed in a quoting char (generally a double quote)

  • If a \n or \r char appears in a field, the field is also enclosed in the quoting char

  • If the quoting char appears in the field, it is prefixed with the escape char (generally \)

  • If the escape char appears in a field, it is prefixed with itself

Example

With separator: ,, quoteChar: ", escapeChar: \

Input:

a normal,line
"the,delimiter","in,a field"
"but \"also\"","double \"quote\", in a field"
and \\ in a field, "is \"possible\" too

Output:

a normal            line
the, delimiter      in, a field
but "also"          double "quotes", in a field
and \ in a field    is "possible" too

Escaping only

  • If the separator char appears in a field, it is prefixed with the escape char (generally )

  • If \n or \r appears in a field, it is prefixed with the escape char

  • If the escape char appears in a field, it is prefixed with itself

In this format, the double quote is not a special character

Example

With separator: ,, escapeChar: \

Input:

a normal,line
the \,delimiter, in \,a field
but "also", double "quotes"\,in a field
and \\ in a field, is "possible" too

Output:

a normal            line
the, delimiter      in, a field
but "also"          double "quotes", in a field
and \ in a field    is "possible" too

No escaping, no quoting

  • The separator may not appear in a field

  • \n or \r may not appear in a field

In this format, the double quote is not a special character.

Input:

a normal, line
is the, only "possible" thing

Output:

a normal            line
is the              only "possible" thing

Usage in datasets

Data Science Studio tries to autodetect the quoting style when you create a new external dataset.

This will not always be able to detect the correct quoting style, as quoting/escaping chars might not appear in the first lines.

Make sure to select the correct quoting style for your dataset. Using wrong quoting styles can lead to huge number of columns being generated or values of arbitrarily large size. This may cause some memory issues.

Note

When using “no quoting, no escaping” style, it is highly recommended to have a very exotic separator like u0001

Note also that this style is not able to represent all data.

Usage in recipes

Recipe type

Restrictions

Hive

Escaping only and “no escaping no quoting” are the only supported styles

When creating a new managed dataset from the Hive recipe editor, it automatically gets “Escaping only ” style

Additionally, due to Hadoop limitations, files with newlines in fields cannot be used

Pig

The only supported style is “No escaping, no quoting”.

When creating a new managed dataset from the Pig recipe editor, it will automatically get the correct style.

Additionally, due to Hadoop limitations, files with newlines in fields cannot be used

Other

If “no escaping, no quoting” is used as output and either the delimiter, \n or \r appears in a field, the offending character is automatically replaced by a space