Data FAQs
SpyCloud's data ingestion and exposure modeling process produces massive volumes of identity-related data. Below are answers to common questions about trust, fidelity, duplication, and interpretation of the data.
š Why do some credentials appear multiple times?
You may encounter the same email, username, or password across several records. This is expected for identities that:
- Reuse passwords across different platforms
- Are victims of multiple breaches or malware infections
- Appear in both structured breaches and combolists
SpyCloud preserves these variations so you can assess consistency, exposure frequency, and risk ā not just deduplicated selectors.
š¦ What are combolists, and why are they included?
Combolists are credential pair dumps (email:password) typically sourced from:
- Aggregated breach data
- Password guessing tools
- Cracking community uploads
Though often noisy or synthetic, they are operationally useful in identifying reused or recycled passwords and credential stuffing activity.
š Why is my data in a 'fake' or unverifiable breach?
Some breaches do not have an associated public disclosure or attribution (e.g., "unverified-saas-2023"). This doesnāt make them fake ā it means:
- SpyCloud has acquired the dataset
- Itās been processed, parsed, and linked to selectors
- But the source company or domain hasnāt confirmed the breach
This is common for underground-sourced datasets.
š§Ŗ Are duplicates a quality issue?
Not necessarily. Duplication in our dataset is often:
- Intentional, to reflect multiple sources
- Useful for validation and exposure analysis
- Avoided where unnecessary via record-level deduplication (same source_id + same selectors = merged)
SpyCloud uses internal logic to collapse records only when duplication adds no investigative value.
š« Why does SpyCloud include low-fidelity or noisy sources?
While some sources (like combolists or scraped forums) may seem low-quality, they serve use cases such as:
- Password hygiene monitoring
- Detection of reused credentials across multiple actors
- Understanding threat actor tooling and common lists used in attacks
Customers can filter by severity, data_type, or breach_category to fine-tune what they ingest or act on.
š How does SpyCloud validate breach data?
SpyCloud employs:
- LLM-based parsing to flag structured data patterns
- Human analyst review for sensitive, ambiguous, or high-impact breaches
- Cross-matching with known datasets to eliminate fakes and padded lists
- Selector pivoting to validate record depth and context
š¤ Why are some records tagged 'sensitive' or 'restricted'?
These tags are applied to datasets that include:
- Government domains
- Politically exposed organizations
- HUMINT-derived content
- Law enforcement subject matter
While still searchable, these records may have extra restrictions on visibility, export, or licensing.
š§ Analyst Tip
If an identity shows up across multiple sources, it's not a glitch ā it's a signal. Identity reuse across malware, breach, and combolist sources is a leading indicator of real-world compromise or abuse.
Updated 5 months ago