13 min read

Using Canary Tokens to Detect Sensitive Data Leakage Across Systems






Traditional Canary Tokens vs. Leak-Detection Canaries

Traditional canary tokens are planted secrets (e.g. fake API keys, document IDs, or honeypot URLs) that security teams use to detect malicious use: when an attacker exfiltrates or reuses a stolen credential, the canary "trips" and alerts the team. The goal is to know that a secret was used by a bad actor.

Leak-detection canaries serve a different purpose. You plant fake but realistic-looking values (fake SSNs, card numbers, passwords, tokens) into real application flows (demo accounts, dev data, test fixtures) and then monitor internal and third-party systems for those values. If a canary value shows up in API response, Customer data files, Datadog logs, a Sentry error, a CI log, or a support dashboard, you've found a code path or integration that is leaking sensitive data. The goal is to find where your systems are exposing data, so you can fix the root cause before real PII or secrets are exposed the same way.

This approach is simple: plant → monitor → alert → fix. It complements (and often outperforms) broad regexes or AI-based scanners because the signal is unambiguous: if you see your canary, something in your stack is leaking it.


Why This Complements Generic Detection

Generic detection has real limits:

  • Regex and pattern matching: Hard to scale. You either miss variants (e.g. masked SSNs, different formats) or get huge numbers of false positives from non-sensitive data that looks like SSNs or card numbers. Tuning is painful.
  • AI or ML scanners: Can help with classification, but they add cost, latency, and operational complexity. They also struggle with "is this our data or test data?" You want to know that our systems leaked our canary.
  • Manual review: Doesn't scale. You can't eyeball every log line or error.

Canary-based leak detection gives you high-confidence, low-noise signal:

  • Deterministic: You control the exact values. A match in a log or error means "this system received or stored our canary." No guessing.
  • Attributable: You can use different canaries per environment, service, or even per test fixture (e.g. canary-ssn-demo-us, canary-card-payments-dev). When one appears, you know which flow or dataset it came from.
  • Real-time: Set up alerts (e.g. Slack, PagerDuty) on canary matches. As soon as a canary appears in Datadog, Sentry, or a CI log, the team can investigate.
  • Scales with your stack: Add more canaries as you add services or integrations; add more monitors as you add sinks (new log aggregator, new error tracker, new pipeline).

You're not replacing regexes or scanners; you're adding a targeted way to discover real leaks through your code and your services.


Where to Plant Canaries: Demo and Dev Accounts

Customer demo accounts and dev/test accounts are ideal places to plant canary values. They run through real application code, real APIs, and real integrations, but they don't contain real customer PII or production secrets. If a canary leaks, you've found a bug or misconfiguration without exposing real data.

Why demo and dev accounts work well

Benefit Explanation
Real code paths Demo and dev flows use the same services, SDKs, and third-party calls as production. A canary in a demo user's "SSN" or "card number" will follow the same paths as real data.
Safe to share Demo environments are often used by sales, support, or partners. Any leak of demo data is embarrassing but not a breach of real PII. Canaries make leaks visible.
Controlled variety You can create multiple demo users or dev fixtures with different canaries (e.g. one with canary-ssn-001, another with canary-ssn-002) and attribute leaks to specific flows or features.
CI and tests Automated tests that use fixtures with canary SSNs, card numbers, or fake API keys will push those values through code paths that run in CI. If CI logs or artifacts ever leak, canaries will show up.

What to plant

Use realistic-format, fake values so that any code that treats them as real (logging, error reporting, analytics) will handle them the same way it would real data. Examples:

Data type Example canary value Notes
SSN 078-05-1120 (IRS test SSN) or custom canary-ssn-demo-01 Use a known test SSN or a unique string you can regex.
Card number Test card numbers (e.g. 4242 4242 4242 4242) or canary-card-demo-01 Stripe and others publish test numbers; or use a distinct prefix.
API key / secret sk_canary_demo_xxxxxxxx or canary-secret-payments-dev Looks like a key; easy to search.
Password CanaryPassword-Demo-01! or canary-pwd-demo Never use a real password; use a string that cannot be mistaken for user input.
Email canary-demo-01@yourcompany-internal-test.com Domain you control; easy to filter.
Account ID / token canary-acct-demo-001, canary-token-sentry-test Unique enough to avoid collisions.

Important: Document every canary value and where it's used (which demo user, which fixture, which env). When it appears in a log or error, you need to trace it back to the source.


Where to Monitor: Sinks and Systems That Might See the Data

Canaries only help if you look for them in every place your data might end up. That includes internal systems and third-party services.

Internal systems

  • API Responses - Search for these leaked canary values in API responses. You can use Burp find and match feature or an API testing tool to detect these. You can also have a middleware to alert whenever this data is exposed in response(preferred)
  • Files - Monitor data inside s3 files, doc files, etc.
  • Application logs (e.g. shipped to Datadog, Splunk, CloudWatch Logs): Search for canary values. Any match means a log statement (or serialization path) is including sensitive-looking data.
  • CI/CD logs and artifacts: Jenkins, GitHub Actions, GitLab CI, CircleCI, etc. If tests or deploy scripts log request/response bodies or env vars, canaries in test data can leak here.
  • Debug or support tools: Internal dashboards that show user payloads, support tools that display "last request," or replay tools that dump request bodies. Canaries in demo accounts will show up if those tools are fed the same data.
  • Data warehouse or analytics: If demo or dev data is synced into a warehouse or analytics pipeline, run periodic searches for canary values. Their presence may be intentional (e.g. demo segment) but if they appear in tables or columns that are supposed to be masked or excluded, you've found a leak.

Third-party services

  • Error and performance trackers: Sentry, Datadog APM, Rollbar, etc. These often capture request bodies, headers, or local variables. A canary in a demo user's payload that appears in a Sentry event means your error reporting is ingesting sensitive-looking data.
  • Log aggregators: Datadog Logs, Sumo Logic, Elastic, etc. Same idea: search for canary values. Matches indicate logs (or log processors) that are emitting data they shouldn't.
  • Support and CRM: Zendesk, Intercom, or custom tools that attach user context (e.g. "last 5 requests") to tickets. Demo canaries will appear if those tools receive full payloads.
  • Webhooks and outbound integrations: If you send events or payloads to partners or internal consumers, ensure they're not logging or storing full bodies. Sending canary payloads and checking the recipient's logs (if you have access) can reveal leaks on the other side.

Summary: what to monitor

Sink What to do
API Responses Search for canary values in API responses.
Datadog (logs, APM) Saved views or monitors: search for canary regex; alert on match.
Sentry (and similar) Search events for canary strings; consider data scrubbing rules so canaries are the only sensitive-looking values you allow, to test that scrubbing works.
CI/CD logs Grep or log search for canary values in job output and artifacts.
Warehouse / analytics Scheduled query or scan for canary values in sensitive or PII-tagged tables/columns.
Support / CRM Periodic check (manual or automated) that demo context doesn't contain raw canaries in user-visible fields.

How to Identify and Define Patterns

To monitor at scale, you need consistent, searchable patterns for your canaries. That means naming conventions and regexes (or equivalent) that you can plug into Datadog, Sentry, or your SIEM.

1. Use a consistent prefix or format

Pick a pattern that is:

  • Unique so it doesn't appear in real data or third-party content (e.g. canary-, fake-, or a company-specific prefix like acme-canary-).
  • Structured so one regex can match many canaries (e.g. canary-(ssn|card|secret|pwd)-[a-z0-9-]+).
  • Documented in a single place (e.g. internal wiki or config) so everyone knows what to search for.

Example patterns:

canary-ssn-demo-01
canary-card-payments-dev
canary-secret-api-demo
canary-pwd-demo-account
fake-ssn-test-001

2. Build regex (or equivalent) for each sink

Once you have a convention, turn it into a search pattern.

Single canary (exact):

canary-ssn-demo-01

All SSN canaries:

canary-ssn-[a-z0-9-]+

All canaries (any type):

canary-(ssn|card|secret|pwd|token|acct)-[a-z0-9-]+

Fake SSN format (e.g. IRS test or custom):

078-05-1120|canary-ssn-

Fake card (Stripe test or custom):

4242\s*4242\s*4242\s*4242|canary-card-

Use the narrowest pattern that still catches the leaks you care about. For example, start with canary-ssn-demo-01 in one monitor; if you add more SSN canaries, switch to canary-ssn-[a-z0-9-]+.

3. Map patterns to sinks

Keep a small matrix so you don't miss a sink:

Pattern Datadog Sentry CI (e.g. GitHub Actions) Warehouse
canary-ssn-* Monitor + alert Search / alert Grep in logs Scheduled query
canary-card-* Monitor + alert Search / alert Grep in logs Scheduled query
canary-secret-* Monitor + alert Search / alert Grep in logs N/A
078-05-1120 Monitor + alert Search / alert Grep in logs Scheduled query

4. Triage when a canary appears

When an alert fires:

  1. Identify the source: Which demo user, fixture, or test? Use your documentation of where each canary is planted.
  2. Identify the path: Which service, log statement, or integration sent the canary to this sink? Trace back (code, config, or pipeline).
  3. Fix the root cause: Remove or redact the sensitive-looking field from that path (e.g. don't log request body, scrub PII before sending to Sentry, exclude sensitive env vars from CI logs).
  4. Verify: Re-run the same flow and confirm the canary no longer appears in that sink. Optionally add a test or check that asserts "canary X does not appear in log/error output."

Over time, you'll build a list of "known bad" patterns (e.g. "never log req.body in this middleware") and "known good" patterns (e.g. "always use Sentry's scrub list for these fields").


Using Semgrep to Find Leak-Prone Code Paths

Canaries tell you that something leaked at runtime. They don't tell you where in the code the leak happens until you triage. Semgrep can narrow the search: use it to find code patterns that are likely to send sensitive or user-controlled data into logs, error trackers, or CI output. Then fix those patterns proactively, or run canary flows and confirm that Semgrep-flagged code paths are no longer leaking.

What Semgrep can detect

Semgrep is static analysis: it looks at source code and matches patterns. For leak detection, you care about:

Pattern Why it matters
Logging request/response body or full context Request bodies often contain PII, cards, or tokens. If they're passed to a logger, they can end up in Datadog, Splunk, or CI logs.
Sending unsanitized context to error trackers (Sentry, Rollbar, etc.) captureException(e, { extra: req.body }) or similar sends full payloads to Sentry. Canaries in demo payloads will show up there.
Logging or printing env vars in CI Secrets or canary values in env vars can leak into CI logs or job output.
Serializing full objects (e.g. user, paymentMethod) in log statements Even "debug" logs get shipped. Any place that logs a whole object may expose PII.

You don't need Semgrep to replace canary monitoring. Use Semgrep to find candidate leak points in code; use canaries to confirm that a path actually leaks and to catch regressions after you fix or add new code.

Example Semgrep rules

1. Flag logging of request body (Node/Express-style):

rules:
  - id: log-request-body
    pattern-either:
      - pattern: $LOG(..., $REQ.body, ...)
      - pattern: $LOG(..., $REQ.rawBody, ...)
    message: "Request body logged; may leak PII or canary data to log aggregators. Redact or remove."
    languages: [javascript]
    severity: WARNING

2. Flag passing request or user context into error capture (Sentry-style):

rules:
  - id: sentry-extra-request-body
    pattern: $SENTRY.captureException(..., { ..., extra: { ... $REQ ... } })
    message: "Request object passed to Sentry extra context; may leak PII. Use scrubbing or allowlist only safe fields."
    languages: [javascript]
    severity: WARNING

3. Flag logging full objects that might be user or payment data (generic):

rules:
  - id: log-user-or-payment-object
    pattern-either:
      - pattern: $LOG(..., $OBJ, ...)
    metavariable-regex:
      metavariable: $OBJ
      regex: (user|paymentMethod|payment_method|req\.body)
    message: "User or payment-like object logged; risk of PII/secret leakage. Log only IDs or redacted fields."
    languages: [javascript]
    severity: WARNING

Adjust patterns to your stack (e.g. Python logging.info(req.body), Ruby Rails.logger.debug(params)). The goal is to surface every place that might send sensitive-looking data to a sink. Run Semgrep in CI so new leak-prone code is caught before merge; combine with canary runs to verify that fixes work and no new paths leak.

Using Semgrep to detect leaked canary values

Beyond leak-prone paths, you can use Semgrep to find canary values themselves in the codebase. The goal: catch places where a canary string literal appears in code that would send it to a sink (logs, error trackers, stdout) or in files where it doesn't belong (e.g. production code instead of test fixtures). If a canary value is passed directly to a logger or to Sentry, that's a canary leak in source; Semgrep can flag it before it ever hits Datadog or Sentry at runtime.

1. Find any canary value in code (inventory and wrong-place check):

Use a regex that matches your canary naming convention in the raw file. This surfaces every occurrence so you can confirm they're only in allowed locations (fixtures, demo config, test data).

rules:
  - id: canary-value-in-code
    pattern-regex: '(canary-(ssn|card|secret|pwd|token|acct)-[a-z0-9-]+|078-05-1120|canary-demo-01@\\S+)'
    message: "Canary value in code. Ensure it's only in test/demo fixtures or config; remove from production code or logs."
    languages: [generic]
    severity: WARNING

To flag only when a canary appears outside allowed paths (e.g. not in fixtures or tests), add a path filter. In Semgrep you can use paths: exclude to ignore **/fixtures/, **/tests/, **/demo/**, and then the rule fires only when a canary shows up in production code or other disallowed areas.

2. Flag canary values passed to a log or print (explicit leak):

When the canary string literal is an argument to a logging or error-capture call, that code is directly leaking the canary. Match that pattern so you fix it in code, not only at runtime.

rules:
  - id: canary-value-logged
    pattern-either:
      - pattern: $LOG(..., $ARG, ...)
      - pattern: console.log(..., $ARG, ...)
      - pattern: print(..., $ARG, ...)
    metavariable-regex:
      metavariable: $ARG
      regex: '(canary-(ssn|card|secret|pwd|token|acct)-[a-z0-9-]+|078-05-1120)'
    message: "Canary value is passed to log/print; it will appear in log aggregators. Remove or redact."
    languages: [javascript]
    severity: WARNING

Duplicate for Python (logging.info, print), Ruby, etc., and reuse the same metavariable-regex so any canary literal in that argument position is flagged.

3. Flag canary values in error capture context (e.g. Sentry):

Same idea for error trackers: if a canary string is passed as part of extra or context, it will be sent to Sentry or similar.

rules:
  - id: canary-value-in-sentry-context
    pattern: $SENTRY.captureException(..., { ... $EXTRA ... })
    metavariable-regex:
      metavariable: $EXTRA
      regex: '(canary-(ssn|card|secret|pwd|token|acct)-[a-z0-9-]+|078-05-1120)'
    message: "Canary value sent to error tracker. Redact or remove from context/extra."
    languages: [javascript]
    severity: WARNING

You can use a single metavariable-regex list in a shared config and reference it from multiple rules so all canary patterns stay in one place. Run these rules in CI: any match means a canary value is present in code in a way that would leak it; fix before merge so your runtime canary monitoring stays meaningful.

How this fits with canaries

Step Tool Purpose
Find candidate leak points Semgrep Static rules for "logs request body," "sends req to Sentry," etc.
Confirm and attribute leaks Canaries Runtime: if canary appears in a sink, that path leaks.
Prevent regressions Both Semgrep in CI blocks new leak patterns; canary monitors alert if a known path starts leaking again.

You can also use Semgrep to enforce canary usage in tests: e.g. a rule that flags test fixtures using hardcoded SSN-like or card-like strings that don't match your canary prefix, so test data stays clearly fake and monitorable.


Putting It Together: A Simple Workflow

Step 1: Create and document canaries

  • Define fake values for SSN, card, secret, password, etc., with a consistent prefix.
  • Add them to demo accounts and dev/test fixtures (DB seeds, test payloads, env vars for local/CI).
  • Record in a doc or config: canary value → where it's used (app, env, user/fixture).

Step 2: Add canaries to real flows

  • Ensure demo and dev flows that touch "sensitive" features actually use these values (e.g. demo user has canary SSN, test payment uses canary card). The goal is to exercise the same code paths as production.
  • Run automated tests that use these fixtures so CI pipelines also process canary data.

Step 3: Configure monitors and alerts

  • Datadog: Create a log monitor (or APM facet search) that triggers when your canary regex matches. Send notifications to Slack or PagerDuty.
  • Sentry: Use search or saved filters for canary strings; optionally create alerts when such events appear.
  • CI: Add a step (e.g. in GitHub Actions) that fails the job if canary values appear in logs or artifacts, or run a nightly grep over recent job logs and alert on match.
  • Warehouse: If demo/dev data is in the warehouse, run a scheduled query that checks for canary values in columns tagged as PII/sensitive and alert if found (unless you intentionally store them in a "test data" area).

Step 4: Triage and fix

  • When a canary appears, treat it as a real leak: find the code path or integration, fix it (redact, scrub, or stop logging), and verify.
  • Periodically add new canaries (e.g. for new features or new third-party integrations) and extend monitoring to new sinks.

What to Watch Out For

  • Don't use real PII or real secrets as canaries. Only use values that are explicitly fake and documented. Real data would create compliance and security risk if it leaked.
  • Avoid canary values that look like real secrets to scanners. Some tools will auto-revoke or flag "secrets" in repos or logs. Use a clear fake pattern (e.g. canary-secret-...) so automated secret scanners don't treat them as real.
  • Control who knows the full list of canaries. If the list is public, someone could intentionally inject a canary to trigger alerts. Keep the mapping (value → location) internal.
  • Don't rely on canaries alone. They catch leaks where your fake data flows. They don't replace access control, encryption, or proper handling of real PII. Use them as one layer in a broader data protection strategy.

Summary

Canary tokens and fake values (SSNs, card numbers, secrets, passwords) aren't only for detecting malicious use of stolen credentials. By planting them in demo and dev accounts and test fixtures, and then monitoring internal and third-party systems (Datadog, Sentry, CI/CD, warehouse) for those values, you get high-confidence signal about which code paths and services are leaking sensitive-looking data.

The approach is simple: plant → monitor → alert → fix. Use consistent naming and regex patterns, document where each canary lives, and wire alerts in every sink that might see the data. When a canary appears, triage and fix the root cause. This complements generic regexes and AI scanners and helps you discover real-time leaks without the noise.

Demo and dev environments are ideal because they run real application and integration code without exposing real customer data. Canaries there help you find and fix leaks before they affect production PII or secrets.