OpenAI GPT Text Completion

Dataiku can leverage the OpenAI GPT and ChatGPT APIs to perform natural language tasks based on prompts.

This uses the expressive power of GPTs to allow users to perform specialized or customized NLP tasks, by directly providing examples of what they want, and letting the GPT models perform the task.

Use case examples for this capability can include: * Generating customized outreach campaigns based on customer characteristics and predicted risk profile * Extracting key concepts from text reviews without having to generate costly ontologies * Identifying common problems and root causes from field technician reports * Extracting information from unstructured data without developing costly custom extractors * Matching a city to the country it is in

In order to leverage this capability, you need to provide an OpenAI API key. You can then select the OpenAI model to use, among:

  • GPT 3

  • GPT 3.5 / ChatGPT

  • GPT 4

Note that the OpenAI API is a paid service. Check their API pricing page for more information.

You then provide a prompt, as well as examples.

Setup

This capability is provided by the “OpenAI GPT Text Completion” plugin, which you need to install. Please see Installing plugins. This plugin is Not supported

Once the plugin is installed, an administrator needs to create a preset in order to use it.

First of all, you need to Create an OpenAI account and retrieve an API key.

In DSS, go to the plugin page > settings > OpenAI API connection, and click “Add preset”, give it a name.

Enter your API key, and optionally, select the OpenAI model to use. By default, GPT 3.5 Turbo (aka ChatGPT) is used. This is the recommended model, as it provides excellent performance at a low price.

Make the preset available to everybody, or to specific groups of users.

Usage

Select the dataset containing the inputs, and click on the “OpenAI GPT Text completion” icon in the right panel. Create the output dataset as you usually do.

You then need to fill at least

  • Task: This is the task you want the model to perform. Refer to OpenAI’s API reference for example tasks the models can perform. For example “for each city in the document, give me the name of the country it is in. Give the answer as a comma-separated list”

  • Input text column: the column in the input dataset containing the input

  • Input description: a short description of what the input “is”. This helps the model. For example, “Text containing city names”

  • Output description: a short description of what the input “is”. This helps the model. For example, “Comma-separated country names”

  • Examples: Fill this with several examples of the kind of outputs you would like the model to generate. The more examples, the better the performance. However, more examples will send more tokens per request incurring more charges by OpenAI.

You can now run the recipe.

The output dataset contains the generated text:

  • A column named after the “output descirption” contains the GPT-generated data

  • The “gpt_response” contains the raw API response

  • The “gpt_error_message” and “gpt_error_type” columns contain errors, if any

Advanced usage and settings

Advanced options at preset level

  • The default Concurrency parameter means that 4 threads will call the API in parallel. We do not recommend changing this default parameter unless your server has a much higher number of CPU cores.

  • The Maximum Attempts means that if an API request fails it will be reattempted (default 3 reattempts). Regardless of whether the request fails because of e.g. an access error with your OpenAI account or a throttling exception due to too many concurrent requests, it will be tried again.

  • The Waiting Interval specifies how long to wait before retrying a failed attempt (default 5 seconds). In case of a throttling exception due to too many requests increasing the Waiting Interval may help, however, we recommend first decreasing the Concurrency setting.

Advanced options at recipe level

  • The Temperature parameter controls the imagination of the model. It can be between 0 and 1. Try a lower value for factual generations or a higher one for more creative outputs.

  • The Max tokens parameter determines the maximum tokens the model will generate per row. As OpenAI charges based on the number of tokens, lowering this value can help control costs.

See the OpenAI Documentation for more details

Generating without input

The recipe can also be used without input data. For example “Give me the list of cities with more than 5 million inhabitants”.

In order to use that mode, you must create the recipe from the “+ Recipe” button in the Flow, by selecting “Natural Language Processing”, then “OpenAI GPT Text completion”. Create an output dataset.

Enable “output-only” mode.

You then only set the task, the output description, some examples, and how many rows of output you want.