NLP using AWS APIs

Dataiku can leverage multiple AWS APIs to provide various NLP capabilities

AWS Transcribe

The AWS Transcribe integration provides speech-to-text extraction in 40 languages

This capability is provided by the “Amazon Transcribe” plugin, which you need to install. Please see Installing plugins

Setup

You need to create AWS credentials with the necessary permissions using AWS Identity and Access Management (IAM). If you don’t have an IAM user yet, create one first.

Next, grant the user access to Amazon Transcribe by giving them privileges directly, or by assigning them to a group.

Make sure to take note of the Access key ID & Secret access key which will appear after creation. Once you have an IAM user with the necessary privileges, you just need to provide Dataiku with the credentials.

In Dataiku, navigate to the Plugins > Settings > API and create a preset with the credentials.

The default Concurrency parameter means that 4 calls to the API happen in parallel. This parallelization operates within the API Quota settings defined above. We do not recommend to change this default parameter.

The default Maximum Attempts means that if an API request fails, it will be tried another 3 times. Regardless of why the request fails (e.g. an access error with your AWS account or a throttling exception due to too many concurrent requests), it will be tried again. Note that AWS may charge you depending on the nature of the error, for each additional attempt.

Usage

Let’s assume that you have installed this plugin and that you have a Dataiku project with a folder hosted on S3 bucket containing the audio files to transcribe.

  • Remote Folder must be hosted on a S3 bucket on the same account as the one of Amazon Transcribe containing the audio files to process

  • Only the files in the format FLAC, MP3, MP4, Ogg, WebM, AMR, or WAV are taken into account.

  • The file has to last less than 4 hours in length and less than 2 GB in size (500 MB for call analytics jobs)

To create your transcribe recipe:

  1. Navigate to the Flow, click on the + Recipe button.

  2. Access the Natural Language Processing menu.

  3. If your folder is selected, you can directly find the recipe in the right panel.

Settings

  • Review INPUT parameters

    • The language parameter is the original language of the audio files. If you would like the transcribe api to infer the original language, you can select the Auto-detect option.

    Tip

    Find the available languages here.

    • Check Display JSON checkbox if you want a column with the raw JSON results of the transcription.

    • The Timeout parameter is the maximum time to wait for an audio file to be transcribed, if this the job is longer than that time, the result will not be shown in the dataset. However, the JSON file will appear in the output folder. Leave it empty if you don’t want a timeout.

  • Review CONFIGURATION parameters.

The Preset parameter is automatically filled by the default one made available by your Dataiku admin. You may select another one if multiple presets have been created.

Output

  • Dataset with text transcribed from the audio files

The columns of the output dataset are as follows:

output dataset columns

Column

Description

path

Path to the audio file in the S3 bucket

job_name

Name to identify the job in Amazon Transcribe

transcript

Transcript of the audio file

language

Language detected or setup by the user

language_code

Language code detected or setup by the user

(Optional) json

Raw API response in JSON form

output_error_type

The error type in case an error occurs

output_error_message

The error message in case an error occurs

  • (Optional) Output folder to put the JSON results from Amazon Transcribe

    • Remote folder hosted in an AWS S3 bucket. This folder will be written by Amazon Transcribe by putting the JSON results in this folder when the jobs are done. The plugin will then read that folder to put it in the output dataset.

    • This output folder is optional, if you decide to not give an output folder to the plugin, the results are written in the input folder. Make sure it has the write permissions.

AWS Comprehend

The AWS Comprehend integration provides:

This capability is provided by the “Amazon Comprehend” plugin, which you need to install. Please see Installing plugins.

Setup

Create an IAM user with the Amazon Comprehend policy – in AWS

Let’s assume that your AWS account has already been created and that you have full admin access. If not, please follow this guide. Start by creating a dedicated IAM user to centralize access to the Comprehend API, or select an existing one. Next, you will need to attach a policy to this user following this documentation.

We recommend using the “ComprehendFullAccess” managed policy.

Alternatively, you can create a custom IAM policy to allow “comprehend:*” actions.

Create an API configuration preset – in Dataiku DSS

In Dataiku, navigate to the Plugins > Settings > API and create a preset with the credentials.

(Optional) Review the API QUOTA and PARALLELIZATION settings

  • The default API Quota settings ensure that one recipe calling the API will be throttled at 25 requests (Rate limit parameter) per second (Period parameter). In other words, after sending 25 requests, it will wait for 1 second, then send another 25, etc.

  • By default, each request to the API contains a batch of 10 documents (Batch size parameter). Combined with the previous settings, it means that it will send 25 * 10 = 250 rows to the API every second.

  • This default quota is defined by Amazon. You can request a quota increase, as documented on this page.

  • You may need to decrease the Rate limit parameter if you envision that multiple recipes will run concurrently to call the API. For instance, if you want to allow 5 concurrent DSS activities, you can set this parameter at 25/5 = 5 requests per second.

  • The default Concurrency parameter means that 4 calls to the API happen in parallel. This parallelization operates within the API Quota settings defined above. We do not recommend to change this default parameter.

Usage

Let’s assume that you have a Dataiku DSS project with a dataset containing text data. This text data must be stored in a dataset, inside a text column, with one row for each document.

As an example, we will use the Amazon Review dataset for instant videos. You can follow the same steps with your own data.

To create your first recipe, navigate to the Flow, click on the + RECIPE button and access the Natural Language Processing menu. If your dataset is selected, you can directly find the recipe on the right panel.

Language Detection

Input

Dataset with a text column

Output

Dataset with 5 additional columns

  • Language code from the API in ISO 639 format.

  • Confidence score of the API from 0 to 1.

  • Raw response from the API in JSON format.

  • Error message from the API if any.

  • Error type (module and class name) if any.

Settings

  • Fill INPUT PARAMETERS

    • Specify the Text column parameter for your column containing text data.

  • Review CONFIGURATION parameters

    • The API configuration preset parameter is automatically filled by the default one made available by your Dataiku admin.

    • You may select another one if multiple presets have been created.

  • (Optional) Review ADVANCED parameters

    • You can activate the Expert mode to access advanced parameters.

    • The Error handling parameter determines how the recipe will behave if the API returns an error.

      • In “Log” error handling, this error will be logged to the output but it will not cause the recipe to fail.

      • We do not recommend to change this parameter to “Fail” mode unless this is the desired behaviour.

Sentiment Analysis

Input

Dataset with a text column

Output

Dataset with 8 additional columns

  • Sentiment prediction from the API (POSITIVE/NEUTRAL/NEGATIVE/MIXED).

  • Confidence score in the POSITIVE prediction from 0 to 1.

  • Confidence score in the NEUTRAL prediction from 0 to 1.

  • Confidence score in the NEGATIVE prediction from 0 to 1.

  • Confidence score in the MIXED prediction from 0 to 1.

  • Raw response from the API in JSON format.

  • Error message from the API if any.

  • Error type (module and class name) if any.

Settings

The parameters are almost exactly the same as the Language Detection recipe (see above).

The only change is the addition of Language parameters. By default, we assume the Text column is in English. You can change it to any of the supported languages listed here or choose “Detected language column” if you have multiple languages. In this case, you will need to reuse the language code column computed by the Language Detection recipe.

Named Entity Recognition

Input

  • Dataset with a text column

Output

Dataset with additional columns

  • One column for each selected entity type, with a list of entities.

  • Raw response from the API in JSON format.

  • Error message from the API if any.

  • Error type (module and class name) if any.

Settings

The parameters under INPUT PARAMETERS and CONFIGURATION are almost the same as the Sentiment Analysis recipe.

The one addition is Entity types: select multiple among this list Under ADVANCED with Expert mode activated, you have access to an additional Minimum score parameter: increase from 0 to 1 to filter results which are not relevant. Default is 0 so that no filtering is applied.

Key Phrase Extraction

Input

Dataset with a text column

Output

Dataset with additional columns

  • Two columns for each key phrase ordered by confidence (see Number of key phrases parameter).

    • Key phrase (1-4 words from the input text).

    • Confidence score in the key phrase.

  • Raw response from the API in JSON format.

  • Error message from the API if any.

  • Error type (module and class name) if any.

Settings

The parameters under INPUT PARAMETERS and CONFIGURATION are almost the same as the Sentiment Analysis recipe (see above). The one addition is: Number of key phrases parameter: how many key phrases to extract by decreasing order of confidence score. The default value extracts the Top 3 key phrases.

Visualization

Thanks to the output datasets produced by the plugin, you can create Charts to analyze results from the API. For instance, you can:

  • Filter documents to focus on one language.

  • Analyze the distribution of sentiment scores.

  • Identify which entities are mentioned.

  • Understand what are the key phrases used by reviewers.

After crafting these charts, you can share them with business users in a Dashboard

AWS Translation

The AWS Translation integration provides translation in 71 languages.

This capability is provided by the “Amazon Translation” plugin, which you need to install.

Please see our Plugin documentation page for more details

AWS Comprehend Medical

The AWS Comprehend Medical integration provides Protected Health Information extraction and medical entity recognition in English

Note

This capability is provided by the “Amazon Comprehend Medical” plugin, which you need to install. Please see Installing plugins.

This plugin is Not supported

Warning

Costs for the AWS Comprehend Medical API are significantly higher than other APIs

Please see our Plugin documentation page for more details