Data FAQs

SpyCloud's data ingestion and exposure modeling process produces massive volumes of identity-related data. Below are answers to common questions about trust, fidelity, duplication, and interpretation of the data.


🔁 Why do some credentials appear multiple times?

You may encounter the same email, username, or password across several records. This is expected for identities that:

  • Reuse passwords across different platforms
  • Are victims of multiple breaches or malware infections
  • Appear in both structured breaches and combolists

SpyCloud preserves these variations so you can assess consistency, exposure frequency, and risk — not just deduplicated selectors.

📦 What are combolists, and why are they included?

Combolists are credential pair dumps (email:password) typically sourced from:

  • Aggregated breach data
  • Password guessing tools
  • Cracking community uploads

Though often noisy or synthetic, they are operationally useful in identifying reused or recycled passwords and credential stuffing activity.

🔍 Why is my data in a 'fake' or unverifiable breach?

Some breaches do not have an associated public disclosure or attribution (e.g., "unverified-saas-2023"). This doesn’t make them fake — it means:

  • SpyCloud has acquired the dataset
  • It’s been processed, parsed, and linked to selectors
  • But the source company or domain hasn’t confirmed the breach

This is common for underground-sourced datasets.

🧪 Are duplicates a quality issue?

Not necessarily. Duplication in our dataset is often:

  • Intentional, to reflect multiple sources
  • Useful for validation and exposure analysis
  • Avoided where unnecessary via record-level deduplication (same source_id + same selectors = merged)

SpyCloud uses internal logic to collapse records only when duplication adds no investigative value.

🚫 Why does SpyCloud include low-fidelity or noisy sources?

While some sources (like combolists or scraped forums) may seem low-quality, they serve use cases such as:

  • Password hygiene monitoring
  • Detection of reused credentials across multiple actors
  • Understanding threat actor tooling and common lists used in attacks

Customers can filter by severity, data_type, or breach_category to fine-tune what they ingest or act on.

🔐 How does SpyCloud validate breach data?

SpyCloud employs:

  • LLM-based parsing to flag structured data patterns
  • Human analyst review for sensitive, ambiguous, or high-impact breaches
  • Cross-matching with known datasets to eliminate fakes and padded lists
  • Selector pivoting to validate record depth and context
👤 Why are some records tagged 'sensitive' or 'restricted'?

These tags are applied to datasets that include:

  • Government domains
  • Politically exposed organizations
  • HUMINT-derived content
  • Law enforcement subject matter

While still searchable, these records may have extra restrictions on visibility, export, or licensing.



🧠 Analyst Tip

💡Trust the repetition.

If an identity shows up across multiple sources, it's not a glitch — it's a signal. Identity reuse across malware, breach, and combolist sources is a leading indicator of real-world compromise or abuse.