PII detection¶
PII detection in the LLM Mesh can detect various forms of PII in your prompts and queries, and either block or redact the queries.
Setup¶
You will need a setup with full outgoing Internet connectivity for downloading the models. Air-gapped setups are not supported.
Install and enable the PII detection code env¶
In order to run PII detection, you need a dedicated code environment (see Code environments) with the appropriate packages.
On self-managed DSS¶
In “Administration > Code envs > Internal envs setup”, in the “PII detection code environment” section, select a Python interpreter in the list and click “Create code environment”
In “Administration > Settings > LLM Mesh”, in the “PII Detection” section, select “Use internal code env”
On Dataiku Cloud¶
Create a new Python 3.9 code env
In “Packages to install”, add the following packages
presidio_anonymizer
presidio_analyzer
langdetect
In “Resources”, enter the following:
import spacy
spacy.cli.download("en_core_web_md")
spacy.cli.download("fr_core_news_sm")
spacy.cli.download("de_core_news_sm")
spacy.cli.download("de_core_news_md")
spacy.cli.download("it_core_news_sm")
spacy.cli.download("ja_core_news_md")
spacy.cli.download("nl_core_news_sm")
spacy.cli.download("es_core_news_sm")
Click “Save and update”
In the launchpad, go to the code env tab and set the code env you just created as default for “PII Detection”
(Legacy) If you are using this setup but are not on Dataiku Cloud, do the following instead: in “Administration > Settings > LLM Mesh”, in the “PII Detection” section, select the code env you just created
Enable PII detection in the connection¶
In the LLM connection that you wish to protect, click “PII detection (queries)” > “Add detector”. You can select whether to:
Reject queries where PII is detected
Replace PII by a placeholder, such as “John Smith” -> “<PERSON>”
Replace PII by a hash value, such as “John Smith” -> “0aa12bc86bd123bd”
Remove PII, such as “I said hello to John Smith” -> “I said hello to”
Replace parts of PII by stars, such as “His phone number was (570) 123-4567” -> “His phone number was ********567”
Detected PII types¶
The following entity types are recognized:
Generic entities:
CREDIT_CARD
DATE_TIME
EMAIL_ADDRESS
IBAN_CODE
IP_ADDRESS
LOCATION
PERSON
PHONE_NUMBER
MEDICAL_LICENSE
URL
Country-specific entities:
US_BANK_NUMBER
US_DRIVER_LICENSE
US_ITIN
US_PASSPORT
US_SSN
UK_NHS
ES_NIF
IT_FISCAL_CODE
IT_DRIVER_LICENSE
IT_VAT_CODE
IT_PASSPORT
IT_IDENTITY_CARD
SG_NRIC_FIN
AU_ABN
AU_ACN
AU_TFN
AU_MEDICARE
Details¶
PII Detection is based on Microsoft Presidio library: https://microsoft.github.io/presidio