Data Collection, Parsing & Ingestion

🔄 End-to-End Lifecycle

SpyCloud ingests data from a wide range of breach, malware, and combolist sources. The following outlines the lifecycle of how data enters, is parsed, and becomes available in your investigations.

_{Example: From collection to queryable intelligence}

📥 Step 1: Data Collection

SpyCloud collects breach, malware, and combolist data via multiple mechanisms:

Monitoring dark web, deep web, Telegram, and invite-only forums
Analyst intelligence, HUMINT, and threat actor engagements
Infostealer logs from over 70 malware families
Paste sites, open leaks, and public combolist aggregators

🔍 Step 2: Parsing & Normalization

After acquisition, datasets are parsed through automated and human-in-the-loop workflows.

Key steps include:

Extracting usernames, emails, passwords, IPs, cookies, device IDs
Normalizing field values and standardizing formats
Deduplicating across time, source, and record
Cleaning up irrelevant or malformed data

SpyCloud’s LLM-powered parsing helps classify and validate attributes across thousands of inconsistent formats.

🧪 Step 3: Classification

After parsing:

Datasets are labeled by type (e.g., breach, malware, combolist)
Source confidence and breach category are assigned
Records are given severities from 2, 5, 20, 25, 26 based on contents of the record data.
Metadata is enriched with timestamps, selectors, and behavioral traits

Both automated tagging and analyst review play a role in this step.

🧬 Step 4: Ingestion into the SpyCloud Data Lake

Once classified, records are:

Ingested into the identity data platform
Indexed for access across:
- Investigations Module
- API & IDLink
- Compass risk alerts
- Graph view + AI Insights
Mapped to selectors (email, phone, device ID, etc.)
Deduplicated globally for consistent investigation experience

🧪 Example Flow: Malware Log

Scenario: A Redline infostealer log is posted on a Telegram leak channel.

SpyCloud's ingestion flow:

HUMINT or scraper flags the post
Log is parsed into structured selectors:
- Emails, usernames, passwords
- Cookies, tokens, autofill, IPs
Severity score assigned (25)
Source labeled as malware, tagged as sensitive
Log becomes searchable across SpyCloud products