Data Collection, Parsing & Ingestion
🔄 End-to-End Lifecycle
SpyCloud ingests data from a wide range of breach, malware, and combolist sources. The following outlines the lifecycle of how data enters, is parsed, and becomes available in your investigations.
Example: From collection to queryable intelligence
📥 Step 1: Data Collection
SpyCloud collects breach, malware, and combolist data via multiple mechanisms:
- Monitoring dark web, deep web, Telegram, and invite-only forums
- Analyst intelligence, HUMINT, and threat actor engagements
- Infostealer logs from over 70 malware families
- Paste sites, open leaks, and public combolist aggregators
🔍 Step 2: Parsing & Normalization
After acquisition, datasets are parsed through automated and human-in-the-loop workflows.
Key steps include:
- Extracting usernames, emails, passwords, IPs, cookies, device IDs
- Normalizing field values and standardizing formats
- Deduplicating across time, source, and record
- Cleaning up irrelevant or malformed data
SpyCloud’s LLM-powered parsing helps classify and validate attributes across thousands of inconsistent formats.
🧪 Step 3: Classification
After parsing:
- Datasets are labeled by type (e.g., breach, malware, combolist)
- Source confidence and breach category are assigned
- Records are given severities from 2, 5, 20, 25, 26 based on contents of the record data.
- Metadata is enriched with timestamps, selectors, and behavioral traits
Both automated tagging and analyst review play a role in this step.
🧬 Step 4: Ingestion into the SpyCloud Data Lake
Once classified, records are:
- Ingested into the identity data platform
- Indexed for access across:
- Investigations Module
- API & IDLink
- Compass risk alerts
- Graph view + AI Insights
 
- Mapped to selectors (email, phone, device ID, etc.)
- Deduplicated globally for consistent investigation experience
🧪 Example Flow: Malware Log
Scenario: A Redline infostealer log is posted on a Telegram leak channel.
SpyCloud's ingestion flow:
- HUMINT or scraper flags the post
- Log is parsed into structured selectors:
- Emails, usernames, passwords
- Cookies, tokens, autofill, IPs
 
- Severity score assigned (25)
- Source labeled as malware, tagged assensitive
- Log becomes searchable across SpyCloud products
Updated about 1 month ago