Pig is a tool of the Hadoop ecosystem to easily create data processing scripts that harness the power of MapReduce over large amounts of HDFS data. For more information, see http://pig.apache.org
Data Science Studio provides an advanced integration with Pig. You can write Pig recipes to compute new HDFS datasets
Prior to writing Pig recipes, you need to ensure that DSS and Hadoop are properly configured together. See Setting up Hadoop integration.
You should have at least a basic knowledge of PigLatin, the scripting language that Pig uses.
Constraints on the datasets¶
Pig recipes can only use HDFS datasets as inputs and outputs. If the data that you want to process is not currently residing in a HDFS dataset, use the Sync recipe to copy the data to HDFS.
Not all file formats are usable in Pig recipes. The following formats are supported:
- “Quoting style” must be set to “no escaping nor quoting”
- “Pig flavor” option must be set in the input dataset, especially if complex types are used. For more information about the interaction between Parquet and DSS, see Parquet.
For CSV datasets, this quoting style cannot be used to represent all data : it does not support having the separator as part of a field. For more information on CSV quoting styles, see Delimiter-separated values (CSV / TSV).
If you need to have specific chars in your fields, consider using another format, or another separator.
If your input datasets are not already using « no quoting nor escaping », you can use a Sync recipe to copy your input datasets to another dataset that Pig can process. Mind the above warning.
Both Avro and Parquet are advanced formats that support complex types (map, array, and object). However, the Pig object model cannot be fully mapped to those all those types. In particular, only array of objects are supported, since Pig only supports bag of tuples.
Note that for output datasets, if you create them directly in the Pig recipe editor using the “New managed dataset” option, they will be automatically created with the proper format and CSV quoting styles.
The datasets must have a schema. The schema columns must only contain alphanumerical characters and underscores. If you have more complex columns, consider using a data preparation recipe first to normalize the schema columns.
Creating your first Pig recipe¶
- Create a new Pig recipe
- Select the input datasets. Only HDFS datasets that have a compatible set of parameters will be proposed.
- Select the datasets that will store the results of the Pig script. You can use existing not-yet-connected HDFS datasets or create new managed datasets (which can only be stored on HDFS). If you create a new managed dataset and your input is partitioned, it’s recommended to use the « Copy partitioning » option.
- Do not set a schema for your output datasets. The Pig recipe editor can automatically fill the schema by analyzing the Pig script.
You can now write your Pig script.
Data Science Studio provides an advanced IDE for writing Pig scripts. It gives you :
- Syntax highlighting
- Automatic syntax and consistency validation.
- The “Pig” relations explorer tab, which automatically gives you the detailed Pig schema of all relations defined in your script.
DKULOAD and DKUSTORE¶
The Pig script that you write should read data from the recipe’s input datasets and write data to the recipe’s output datasets.
When manually writing Pig scripts, you would use the LOAD and STORE PigLatin commands, while manually taking care of :
- Writing the absolute HDFS paths to data
- Selecting the proper input partitions and writing in the proper output partitions
- Defining the storage parameters
- Defining the whole schema of all datasets.
Data Science Studio greatly simplifies these steps using the DKULOAD and DKUSTORE macros.
If we have a Pig recipe that takes dataset “d1” as input and writes dataset “d2” as output, you can write, in your script :
myrelation = DKULOAD 'd1'; [...] output = FOREACH ... DKUSTORE output INTO 'd2';
Validation, relation explorer and automatic schema¶
At any time while writing your Pig script, you can click the “Validate” button to perform a comprehensive validation of your script. The validate button performs all checks that Pig normally performs, like:
- Erroneous column names
- Wrong types
- Impossible projections
The errors will be shown in context (on the failing line), with a detailed error message.
The validation also updates the “Pig” tab that shows the schema of each relation. Use it to better understand what you currently have and which fields you should project.
When you validate, if the script contains some STORE or DKUSTORE statements, Data Science Studio automatically checks the schema of the output dataset versus the schema of the relation that is being stored.
If the schema don’t match (which will always be the case when you validate for the first time after adding a STORE or DKUSTORE), Data Science Studio will explain the incompatibility and propose to automatically adjust the output datasets schemas.
For quicker validation, you can use the Ctrl + Enter keyboard shortcut
Auto-completion can be requested any time by pressing Ctrl+Space. Note that auto-completion on relations and fields can only be performed after validating the script.