PII detection

PII detection in the LLM Mesh can detect various forms of PII in your prompts and queries, and either block or redact the queries.

Setup

You will need a setup with full outgoing Internet connectivity for downloading the models. Air-gapped setups are not supported.

Install and enable the PII detection code env

In order to run PII detection, you need a dedicated code environment (see Code environments) with the appropriate packages.

On self-managed DSS

  • In “Administration > Settings > Misc”, in the “PII detection code environment” section, select a Python interpreter in the list and click “Create code environment”

  • In “Administration > Settings > LLM Mesh”, in the “PII Detection” section, select “Use internal code env”

On Dataiku Cloud

  • Create a new Python 3.9 code env

  • In “Packages to install”, add the following packages

presidio_anonymizer
presidio_analyzer
langdetect
  • In “Resources”, enter the following:

import spacy
spacy.cli.download("en_core_web_md")
spacy.cli.download("fr_core_news_sm")
spacy.cli.download("de_core_news_sm")
spacy.cli.download("de_core_news_md")
spacy.cli.download("it_core_news_sm")
spacy.cli.download("ja_core_news_md")
spacy.cli.download("nl_core_news_sm")
spacy.cli.download("es_core_news_sm")
  • Click “Save and update”

  • In the launchpad, go to the code env tab and set the code env you just created as default for “PII Detection”

  • (Legacy) If you are using this setup but are not on Dataiku Cloud, do the following instead: in “Administration > Settings > LLM Mesh”, in the “PII Detection” section, select the code env you just created

Enable PII detection in the connection

In the LLM connection that you wish to protect, click “PII detection (queries)” > “Add detector”. You can select whether to:

  • Reject queries where PII is detected

  • Replace PII by a placeholder, such as “John Smith” -> “<PERSON>”

  • Replace PII by a hash value, such as “John Smith” -> “0aa12bc86bd123bd”

  • Remove PII, such as “I said hello to John Smith” -> “I said hello to”

  • Replace parts of PII by stars, such as “His phone number was (570) 123-4567” -> “His phone number was ********567”

Detected PII types

The following entity types are recognized:

Generic entities:

  • CREDIT_CARD

  • DATE_TIME

  • EMAIL_ADDRESS

  • IBAN_CODE

  • IP_ADDRESS

  • LOCATION

  • PERSON

  • PHONE_NUMBER

  • MEDICAL_LICENSE

  • URL

Country-specific entities:

  • US_BANK_NUMBER

  • US_DRIVER_LICENSE

  • US_ITIN

  • US_PASSPORT

  • US_SSN

  • UK_NHS

  • ES_NIF

  • IT_FISCAL_CODE

  • IT_DRIVER_LICENSE

  • IT_VAT_CODE

  • IT_PASSPORT

  • IT_IDENTITY_CARD

  • SG_NRIC_FIN

  • AU_ABN

  • AU_ACN

  • AU_TFN

  • AU_MEDICARE

Details

PII Detection is based on Microsoft Presidio library: https://microsoft.github.io/presidio