Avro

Avro is an efficient file format. Its main points are:

  • Compact, fast, binary data format.
  • Rich data structures (map, union, array, record and enum).
  • Very well suited for data exchange since the schema is stored along with the data (unlike CSV).

Applicability

Reading Avro files

Data Science Studio is fully capable of reading most Avro files. However, there are some fundamental difference between the Avro object model and ours. For instance, Data Science Studio cannot natively represent some Avro types, and some conversions are automatically applied.

  • Unions
    • In general, the union type is replaced by the most specific type which is able to represent all the unified types. For instance ["int","float","null"] will be converted to double. When no conversion is possible, DSS always falls back to string.
    • DSS perfectly recognizes the common idiom for nullable types in Avro ["null", "the_other_type"].
  • Enums
    • They are converted to string.

Reading Avro files / multiple versions

Schema evolution is unavoidable, and you may be in a situation where you have a large partitioned dataset composed by many Avro files of different version. In that case, you need to manually define the schema you want to read in the dataset’s settings.

Writing Avro files

Data Science Studio also supports writing Avro files. The output Avro schema is deduced from the dataset’s schema, and cannot be enforced. As a direct consequence of this, all types are always wrapped in a nullable union.

Limitations and issues

  • Date type is not (yet) supported by Avro. This feature is planned for Avro 1.8 (see the corresponding issue on the Avro bugtracker).
  • Avro support in Pig is based on the AvroStorage UDF which is distributed in the PiggyBank library. Unfortunately, it has some issues: for instance, a schema containing only one column isn’t correctly handled.
  • Avro 1.7 support in Hive is built-in since version 0.10. However, in Hive < 0.14, the Avro schema must be stored in a separate file on HDFS. Data Science Studio stores Avro schema files in the directory /user/$USER/dss_avro_schemas/.