Data FAQs
SpyCloud's data ingestion and exposure modeling process produces massive volumes of identity-related data. Below are answers to common questions about trust, fidelity, duplication, and interpretation of the data.
🔁 Why do some credentials appear multiple times?
You may encounter the same email, username, or password across several records. This is expected for identities that:
- Reuse passwords across different platforms
- Are victims of multiple breaches or malware infections
- Appear in both structured breaches and combolists
SpyCloud preserves these variations so you can assess consistency, exposure frequency, and risk — not just deduplicated selectors.
📦 What are combolists, and why are they included?
Combolists are credential pair dumps (email:password) typically sourced from:
- Aggregated breach data
- Password guessing tools
- Cracking community uploads
Though often noisy or synthetic, they are operationally useful in identifying reused or recycled passwords and credential stuffing activity.
🔍 Why is my data in a 'fake' or unverifiable breach?
Some breaches do not have an associated public disclosure or attribution (e.g., "unverified-saas-2023"). This doesn’t make them fake — it means:
- SpyCloud has acquired the dataset
- It’s been processed, parsed, and linked to selectors
- But the source company or domain hasn’t confirmed the breach
This is common for underground-sourced datasets.
🧪 Are duplicates a quality issue?
Not necessarily. Duplication in our dataset is often:
- Intentional, to reflect multiple sources
- Useful for validation and exposure analysis
- Avoided where unnecessary via record-level deduplication (same source_id + same selectors = merged)
SpyCloud uses internal logic to collapse records only when duplication adds no investigative value.
🚫 Why does SpyCloud include low-fidelity or noisy sources?
While some sources (like combolists or scraped forums) may seem low-quality, they serve use cases such as:
- Password hygiene monitoring
- Detection of reused credentials across multiple actors
- Understanding threat actor tooling and common lists used in attacks
Customers can filter by severity, data_type, or breach_category to fine-tune what they ingest or act on.
🔐 How does SpyCloud validate breach data?
SpyCloud employs:
- LLM-based parsing to flag structured data patterns
- Human analyst review for sensitive, ambiguous, or high-impact breaches
- Cross-matching with known datasets to eliminate fakes and padded lists
- Selector pivoting to validate record depth and context
👤 Why are some records tagged 'sensitive' or 'restricted'?
These tags are applied to datasets that include:
- Government domains
- Politically exposed organizations
- HUMINT-derived content
- Law enforcement subject matter
While still searchable, these records may have extra restrictions on visibility, export, or licensing.
🧠 Analyst Tip
If an identity shows up across multiple sources, it's not a glitch — it's a signal. Identity reuse across malware, breach, and combolist sources is a leading indicator of real-world compromise or abuse.
Updated 3 months ago