Data FAQs

SpyCloud's data ingestion and exposure modeling process produces massive volumes of identity-related data. Below are answers to common questions about trust, fidelity, duplication, and interpretation of the data.

🔁 Why do some credentials appear multiple times?

You may encounter the same email, username, or password across several records. This is expected for identities that:

Reuse passwords across different platforms
Are victims of multiple breaches or malware infections
Appear in both structured breaches and combolists

SpyCloud preserves these variations so you can assess consistency, exposure frequency, and risk — not just deduplicated selectors.

📦 What are combolists, and why are they included?

Combolists are credential pair dumps (email:password) typically sourced from:

Aggregated breach data
Password guessing tools
Cracking community uploads

Though often noisy or synthetic, they are operationally useful in identifying reused or recycled passwords and credential stuffing activity.

🔍 Why is my data in a 'fake' or unverifiable breach?

Some breaches do not have an associated public disclosure or attribution (e.g., "unverified-saas-2023"). This doesn’t make them fake — it means:

SpyCloud has acquired the dataset
It’s been processed, parsed, and linked to selectors
But the source company or domain hasn’t confirmed the breach

This is common for underground-sourced datasets.

🧪 Are duplicates a quality issue?

Not necessarily. Duplication in our dataset is often:

Intentional, to reflect multiple sources
Useful for validation and exposure analysis
Avoided where unnecessary via record-level deduplication (same source_id + same selectors = merged)

SpyCloud uses internal logic to collapse records only when duplication adds no investigative value.

🚫 Why does SpyCloud include low-fidelity or noisy sources?

While some sources (like combolists or scraped forums) may seem low-quality, they serve use cases such as:

Password hygiene monitoring
Detection of reused credentials across multiple actors
Understanding threat actor tooling and common lists used in attacks

Customers can filter by severity, data_type, or breach_category to fine-tune what they ingest or act on.

🔐 How does SpyCloud validate breach data?

SpyCloud employs:

LLM-based parsing to flag structured data patterns
Human analyst review for sensitive, ambiguous, or high-impact breaches
Cross-matching with known datasets to eliminate fakes and padded lists
Selector pivoting to validate record depth and context

👤 Why are some records tagged 'sensitive' or 'restricted'?

These tags are applied to datasets that include:

Government domains
Politically exposed organizations
HUMINT-derived content
Law enforcement subject matter

While still searchable, these records may have extra restrictions on visibility, export, or licensing.

🧠 Analyst Tip

💡Trust the repetition.

If an identity shows up across multiple sources, it's not a glitch — it's a signal. Identity reuse across malware, breach, and combolist sources is a leading indicator of real-world compromise or abuse.