Data Collection, Parsing & Ingestion

🔄 End-to-End Lifecycle

SpyCloud ingests data from a wide range of breach, malware, and combolist sources. The following outlines the lifecycle of how data enters, is parsed, and becomes available in your investigations.


Example: From collection to queryable intelligence

📥 Step 1: Data Collection

SpyCloud collects breach, malware, and combolist data via multiple mechanisms:

  • Monitoring dark web, deep web, Telegram, and invite-only forums
  • Analyst intelligence, HUMINT, and threat actor engagements
  • Infostealer logs from over 70 malware families
  • Paste sites, open leaks, and public combolist aggregators

🔍 Step 2: Parsing & Normalization

After acquisition, datasets are parsed through automated and human-in-the-loop workflows.

Key steps include:

  • Extracting usernames, emails, passwords, IPs, cookies, device IDs
  • Normalizing field values and standardizing formats
  • Deduplicating across time, source, and record
  • Cleaning up irrelevant or malformed data

SpyCloud’s LLM-powered parsing helps classify and validate attributes across thousands of inconsistent formats.


🧪 Step 3: Classification

After parsing:

  • Datasets are labeled by type (e.g., breach, malware, combolist)
  • Source confidence and breach category are assigned
  • Records are given severities from 2, 5, 20, 25, 26 based on contents of the record data.
  • Metadata is enriched with timestamps, selectors, and behavioral traits

Both automated tagging and analyst review play a role in this step.


🧬 Step 4: Ingestion into the SpyCloud Data Lake

Once classified, records are:

  • Ingested into the identity data platform
  • Indexed for access across:
    • Investigations Module
    • API & IDLink
    • Compass risk alerts
    • Graph view + AI Insights
  • Mapped to selectors (email, phone, device ID, etc.)
  • Deduplicated globally for consistent investigation experience

🧪 Example Flow: Malware Log

Scenario: A Redline infostealer log is posted on a Telegram leak channel.

SpyCloud's ingestion flow:

  1. HUMINT or scraper flags the post
  2. Log is parsed into structured selectors:
    • Emails, usernames, passwords
    • Cookies, tokens, autofill, IPs
  3. Severity score assigned (25)
  4. Source labeled as malware, tagged as sensitive
  5. Log becomes searchable across SpyCloud products