Supported connections

Data Science Studio can read and write data from a variety of sources

Connectors

Here is a list of the available connectors in DSS.

Type Read Write
Filesystem yes (see supported formats) yes (see supported formats)
Hadoop HDFS yes (see supported formats) yes (see supported formats)
Amazon S3 yes (see supported formats) yes (see supported formats)
HTTP yes (see supported formats) (data copied locally) no
FTP yes (see supported formats) (data copied locally when using Remote Datasets) yes
SSH (SCP) yes (see supported formats) (data copied locally) no
MySQL yes yes
PostegreSQL yes yes
HP Vertica yes yes
Amazon Redshift yes yes
EMC Greenplum yes yes
Teradata yes yes
Other SQL databases (JDBC driver) yes not all data types
MongoDB yes yes
Cassandra yes yes
ElasticSearch no yes
Twitter (Streaming API) yes no
Generic APIs Custom Python or R code Custom Python or R code

File formats

Here is a list of the file formats that DSS can read and write for files-based connections (filesystem, HDFS, Amazon S3, HTTP, FTP, SSH)

Standard formats

Format Read Write
Delimited values (CSV, TSV, ...) yes yes
Fixed-width fields yes no
Excel (97-2007) yes only via export
Avro yes yes
Custom format using regular expression yes no
JSON yes no
MySQL Dump yes no
Apache Combined log format yes no

Note that file-based formats can be read compressed: ZIP, GZIP, BZ2.

Hadoop-specific formats

The following formats can be read and written on HDFS only

Format Read Write
TextFile yes yes
Hive Sequence File yes yes
Hive RC File yes yes
Hive ORC File yes yes
Parquet yes yes