🧭 Introduction

This document describes the schema for SpyCloud's breach data repository.

At a high level, our data is organized into two hierarchical structures: Breach Catalog and Breach Records.

🗂️ SpyCloud Data Hierarchy

Breach Catalog: A collection of breaches ingested into our platform. Each catalog entry contains metadata like breach title, acquisition date, affected domains, etc.

Breach Record: A structured collection of data assets extracted from the breach — including credentials, device metadata, and identity attributes grouped per user/persona.

Example: If a breach contains email, password, and phone for five users, we produce 5 records, each with 3 data assets.

Some data assets are extracted directly; others are generated by SpyCloud for added value (e.g., email_username, target_domain).

🧪 Data Normalization

Asset	Normalization
`email`	Lowercased for indexing
`username`	Lowercased
`social_*`	Lowercased (e.g., `social_twitter`)
`*_time`	ISO 8601 datetime
`*_date`	ISO 8601 format
`dob`	ISO 8601 datetime (time component may be irrelevant)

🧬 Character Encoding

All data is normalized to UTF-8. Use UTF-8 handling per language:

Java: new String(bytes, "UTF-8")
Python 2: unicode(obj, "utf-8")
Python 3: str(obj, encoding="utf-8")
Go: Native UTF-8 handling
Perl: Encode::decode("utf8", $bytes)

🗓️ Date and Time Format

All timestamps follow ISO 8601 format: YYYY-MM-DDTHH:MM:SSZ

Example	Description
`2000-01-01T00:00:00Z`	Midnight UTC, Jan 1, 2000
`2018-08-01T14:00:00Z`	2 PM UTC, Aug 1, 2018

🔐 Password Cracking

Field Combination	Meaning
`password_type = plaintext`	Original password was plaintext
`password_type ≠ plaintext` + `password_plaintext` present	SpyCloud cracked the original hashed password

🧾 UUIDs

SpyCloud uses UUID v4 strings to uniquely identify:

Breach catalog entries (uuid)
Breach records (document_id)

Examples:

ae375975-894b-489c-876b-a294dddf0c96
b17d51a7-f4a9-479a-8820-801753e86c05

📚 Breach Catalog Schema

SpyCloud-generated assets such as source_id, domain, email_domain, target_domain, target_subdomain, password_type, sighting, severity or other related data points. Below is the full list of metadata fields available for each breach in SpyCloud’s breach catalog.

Field	Type	Description
`id`	int	Numerical breach ID. Correlates to `source_id` in breach records.
`uuid`	string	UUID v4 encoded version of breach ID (used in Firehose file naming).
`title`	string	Breach title (if disclosable). Otherwise uses a generic label.
`description`	string	Breach description (if disclosable). Otherwise uses a generic summary.
`site`	string	Website of the breached organization (if known).
`site_description`	string	Description of the breached organization (if available).
`type`	string	Indicates if the breach is `public` (found online) or `private` (exclusive to SpyCloud).
`num_records`	int	Number of records parsed, normalized, and deduplicated from this breach.
`spycloud_publish_date`	datetime	Date when the breach was ingested and made available to customers.
`acquisition_date`	datetime	Date SpyCloud first acquired the breach.
`breach_date`	datetime	Estimated date the breach occurred.
`public_date`	datetime	When the breach was disclosed publicly.
`media_urls`	list	List of media URLs referencing the breach.
`assets`	dictionary	Dictionary mapping each asset to its count in the breach.
`confidence`	int	Confidence score in the breach source.
`combo_list_flag`	string	Indicates if the breach is a combo list.
`breach_main_category`	string	Classifies as `combolist`, `breach`, or `malware`.
`breach_category`	string	Further categorization: `combolist`, `exfiltrated`, `exposed`, `infostealer`, `phished`, `scraped`, or `unknown`.
`breached_companies`	list	Companies allegedly or confirmed involved in the breach.
`targeted_companies`	list	Companies determined to be targeted by the breach.
`targeted_industries`	list	Industries believed to be targeted by the breach.
`sensitive_source`	boolean	Indicates whether the source is sensitive or restricted.
`malware_family`	string	Malware family name (only if `breach_category` is `infostealer`).
`status`	string	`Pending`: Indicates that a new breach has been identified and is in the process of being ingested. Note: You can start consuming the incomplete breach records even if the catalog entry says as pending. `Verified`: Indicates that ingestion is complete and all records from the breach have been successfully loaded, validated, and are ready for use.

🧾 Breach Record Schema

Breach Records are collections of data assets parsed and ingested into our data platform. Each Breach Record contains multiple assets, some of which are extracted directly from the parsed data while others are generated by the SpyCloud ingest process. Below is the list of SpyCloud generated assets which will always be present in a breach record.

Field	Type	Description
`document_id`	UUID	Unique ID per identity
`source_id`	int	Links to breach catalog
`spycloud_publish_date`	datetime	When record was published
`severity`	int	Numerical value based on record contents (2–26)

🐛 Infected User Assets

Below is a table of all infected user (botnet) assets which might be present in a breach record.

Field	Type	Description
`av_softwares`	string	List of antivirus software installed on the machine.
`display_resolution`	string	The system display resolution.
`form_cookies_data`	string	Cookie data associated with this person.
`form_post_data`	string	Form post data associated with this person.
`infected_machine_id`	string	A unique identifier either extracted from an infostealer log, or an RFC 4122-compliant UUID generated by SpyCloud when none is present. Format and origin may vary by malware family.
`log_id`	string	A deterministic SHA256 hash computed from the contents of a malware log archive.
`infected_path`	string	Local path to the malicious software installed on the infected system.
`infected_time`	string	The time the system was infected with malware.
`keyboard_languages`	string	Keyboard languages configured in the OS.
`logon_server`	string	Logon server captured from the infected environment.
`mac_address`	string	12-character alphanumeric MAC address of the device.
`port`	string	Network port paired with IP address.
`system_install_date`	datetime	Time at which the system OS was installed.
`system_model`	string	Model identifier of the infected system.
`target_domain`	string	Second-level domain (SLD) extracted from `target_url`.
`target_subdomain`	string	Full subdomain + domain extracted from `target_url`.
`target_url`	string	URL captured via keylogger or form capture.
`user_agent`	string	Browser user-agent string.
`user_browser`	string	Browser name used by the infected user.
`user_hostname`	string	Hostname of the infected system.
`user_os`	string	Operating system name.
`user_sys_domain`	string	System domain name.
`user_sys_registered_organization`	string	Name of the system's registered organization.
`user_sys_registered_owner`	string	Name of the system's registered owner.

🔐 Credentials & Account Assets

Below is a table of all credential and account related assets that may appear in breach records.

Field	Type	Description
`account_caption`	string	Account profile caption.
`account_id`	string	Account number or ID.
`account_image_url`	string	Account image URL.
`account_last_activity_time`	datetime	Timestamp of last account activity (ISO 8601).
`account_login_time`	datetime	Last account login time (ISO 8601).
`account_modification_time`	datetime	Account modification date (ISO 8601).
`account_nickname`	string	Account nickname.
`account_notes`	string	Account notes.
`account_password_date`	datetime	Date when account password was set (ISO 8601).
`account_secret`	string	Account secret answer.
`account_secret_question`	string	Account secret question.
`account_signup_time`	datetime	Account signup date (ISO 8601).
`account_status`	string	Account status.
`account_title`	string	Account title.
`account_type`	string	Account type.
`api_token`	string	API token.
`api_token_secret`	string	API token secret.
`backup_email`	string	Backup email address.
`backup_email_username`	string	Username extracted from `backup_email` (before the `@`).
`domain`	string	Domain name.
`email`	string	Email address.
`email_domain`	string	Extracted domain from `email` (after the `@`).
`email_username`	string	Extracted username from `email` (before the `@`).
`num_posts`	int	Number of posts (typically forum-related).
`password`	string	Account password (original form).
`password_plaintext`	string	Plaintext version of the password (if cracked).
`password_type`	string	Password type (e.g., plaintext, SHA1, MD5).
`private_key`	string	SHA256 hash of private key.
`private_key_password`	string	Password for the private key.
`public_key`	string	SHA256 hash of public key.
`salt`	string	Password salt.
`service`	string	Service associated with credential (e.g., Spotify, Netflix).
`service_expiration`	string	Service expiration date (ISO 8601).
`username`	string	Username.

🧩OTHER ASSETS

SpyCloud has 200+ data types we recaptured as identity artifacts with a rich & diverse subset of fields across PII, financial, geographical, and social media assets.

🧬VERSIONING POLICY

SpyCloud maintains backward compatibility during a major version lifecycle.

We never remove or rename existing fields in a current product version
New fields may be added without notice — clients should handle unknown keys
Breaking changes will only happen during a major upgrade (e.g., v1 → v2)