Veto/docs

Entity Extraction

Extract structured entities (prices, emails, phones, PII) from text for policy evaluation.

The veto-sdk/extractors module provides deterministic, regex-based entity extraction from arbitrary text. It detects prices, emails, phone numbers, salary figures, equity percentages, government IDs, credit cards, and API keys.

This is the same extraction engine used by the Veto browser extension to populate arguments.extracted_entities in browser agent contexts.

Installation

import { extractEntities } from 'veto-sdk/extractors';
// or
import { extractEntities } from 'veto-sdk';

extractEntities(text, options?)

function extractEntities(
  text: string,
  options?: ExtractEntitiesOptions,
): ExtractedEntities

Returns an ExtractedEntities object. Returns empty defaults if text is shorter than 3 characters. Text is capped at textCap characters before processing (default 200,000).

ExtractedEntities

FieldTypeDescription
pricesnumber[]Prices found in the text. Values above $1M are excluded.
max_pricenumberHighest price in prices, or 0 if none.
min_pricenumberLowest price in prices, or 0 if none.
emailsstring[]Deduplicated emails, lowercased.
phone_numbersstring[]Deduplicated phone numbers that pass length heuristics.
salary_figuresnumber[]Salary/compensation amounts. Range: $1,001–$9,999,999.
has_salary_figuresbooleantrue if at least one salary figure was found.
equity_percentagesnumber[]Equity percentages in the range 0–100.
has_equity_infobooleantrue if at least one equity percentage was found.
sensitive_termsstring[]Labels for which sensitive entities were detected: salary, equity, gov_id, credit_card, api_key, email, phone.
has_sensitive_piibooleantrue if sensitive_terms is non-empty.
has_credit_cardsbooleantrue if a Luhn-valid 16-digit card number was found.
has_gov_idsbooleantrue if a government ID pattern matched.
has_api_keysbooleantrue if an API key pattern matched.

ExtractEntitiesOptions

OptionTypeDefaultDescription
maxPricesnumber100Maximum number of prices to collect.
maxEmailsnumber50Maximum number of emails to collect.
maxPhonesnumber50Maximum number of phone numbers to collect.
maxSalaryFiguresnumber50Maximum number of salary figures to collect.
maxEquityPercentagesnumber50Maximum number of equity percentages to collect.
textCapnumber200000Characters to process. Text beyond this limit is ignored.

Supported entity types

Prices

Multi-currency: USD ($), EUR (), GBP (£), JPY (¥), INR (), KRW (), CHF, AUD, CAD, CNY. Currency code prefixes (USD 1,200) are also matched. Values at or above $1,000,000 are excluded.

$49.99   EUR 1,200   ¥3000

Emails

RFC-style pattern with length limits: 64-character local part, 255-character domain. Deduplicated case-insensitively.

user@example.com   hr+payroll@company.co.uk

Phone numbers

International format (+country code) and domestic (10+ digits). Short numeric sequences are filtered out: international numbers require 8+ digits, domestic numbers require 10+.

+1 415 555 0100   (800) 867-5309

Salary figures

Keyword-anchored: salary, compensation, comp, pay, wage, income, base, total comp, OTE, CTC. Supports K suffix ($150K). Required range: $1,001–$9,999,999.

Base salary: $120,000   OTE $200K/yr   compensation: EUR 85,000

Equity percentages

Keyword-anchored after the percentage: equity, vesting, options, ownership, stake, shares, stock, RSUs, ESOP. Range: 0–100%.

0.5% equity   2% vesting   15% ESOP

Government IDs

Three patterns:

  • US SSN: XXX-XX-XXXX
  • UK NIN: XX XX XX XX X
  • US EIN: XX-XXXXXXX

Detection is boolean — matched IDs are not stored in the return value.

Credit cards

16-digit patterns (XXXX-XXXX-XXXX-XXXX or space-separated). Luhn checksum validation reduces false positives. Detection is boolean.

API keys

Tokens starting with sk, pk, api, key, token, secret, or bearer, followed by 20+ alphanumeric characters. Case-insensitive. Detection is boolean.

sk-abc123...   Bearer eyJhbGc...   api_key_ABCD...

Usage with rules

Extracted entities can feed directly into evaluateRulesLocally via the arguments.extracted_entities field, which is the same path the browser extension populates.

import { extractEntities } from 'veto-sdk/extractors';
import { evaluateRulesLocally } from 'veto-sdk';

const entities = extractEntities(pageText);
const result = evaluateRulesLocally(rules, 'browser_click', {
  arguments: { extracted_entities: entities }
});

Example rule that blocks actions when salary data is present:

- id: block-salary-exfil
  name: Block actions on pages with salary data
  enabled: true
  severity: high
  action: block
  tools: [browser_click, form_submit]
  conditions:
    - field: arguments.extracted_entities.has_salary_figures
      operator: equals
      value: true