Data Types & Breach Categories

Data Types at a Glance

SpyCloud classifies each dataset into one of three main data types:

🧯

BREACH

Data taken from a known organization or domain, including exfiltrated employee or customer information – often structured, curated, and cleaned.

🐛

MALWARE

Data harvested from infostealer-infected machines, including credentials, browser fingerprints, session cookies, and web behavior logs.

📦

COMBOLIST

Credential pair lists (email:password or username:password) often leaked or traded on forums, not tied to a single known breach source.


🧠 Breach Categories

SpyCloud further classifies breach datasets into sub-categories using two fields:

  • breach_main_category
  • breach_category

These categories provide added context for the type of exposure, behavior, or source.


🗃️ Combolist

Credential and password pairs (username:password or email:password) found on paste sites, forums, or combolists — often without attribution to a specific breach. Many contain recycled credentials from older exposures or breached datasets.

🔓 Exfiltrated

Breaches that have been exfiltrated by threat actors from an identifiable organization, often including customer records, user databases, or internal employee info. Typically has a known breach name and metadata.

🌍 Exposed

Publicly accessible or misconfigured data stores (e.g., open S3 buckets, FTPs) — found unintentionally but can contain sensitive credentials or personal data.

🎣 Phished

Credentials harvested through phishing campaigns, kits, or spoofed login portals. Often limited in structure but high-fidelity in intent.

🧹 Scraped

Usernames, emails, and account metadata scraped from public websites (e.g., social media, forums) — not stolen directly, but aggregated at scale.

🪟 Malware

Data captured from infected machines, including:

  • Login credentials
  • Cookies
  • Autofill data
  • Browsing history
  • Wallet credentials
  • Device fingerprinting

Collected using infostealer malware such as Redline, Raccoon, Vidar, etc. Richest and most behaviorally complete dataset type in the SpyCloud corpus.