Data Schema
🧭 Introduction
This document describes the schema for SpyCloud's breach data repository.
At a high level, our data is organized into two hierarchical structures: Breach Catalog and Breach Records.
🗂️ SpyCloud Data Hierarchy
Breach Catalog: A collection of breaches ingested into our platform. Each catalog entry contains metadata like breach title, acquisition date, affected domains, etc.
Breach Record: A structured collection of data assets extracted from the breach — including credentials, device metadata, and identity attributes grouped per user/persona.
Example: If a breach contains email, password, and phone for five users, we produce 5 records, each with 3 data assets.
Some data assets are extracted directly; others are generated by SpyCloud for added value (e.g., email_username, target_domain).
🧪 Data Normalization
| Asset | Normalization | 
|---|---|
| email | Lowercased for indexing | 
| username | Lowercased | 
| social_* | Lowercased (e.g., social_twitter) | 
| *_time | ISO 8601 datetime | 
| *_date | ISO 8601 format | 
| dob | ISO 8601 datetime (time component may be irrelevant) | 
🧬 Character Encoding
All data is normalized to UTF-8. Use UTF-8 handling per language:
- Java: new String(bytes, "UTF-8")
- Python 2: unicode(obj, "utf-8")
- Python 3: str(obj, encoding="utf-8")
- Go: Native UTF-8 handling
- Perl: Encode::decode("utf8", $bytes)
🗓️ Date and Time Format
All timestamps follow ISO 8601 format: YYYY-MM-DDTHH:MM:SSZ
| Example | Description | 
|---|---|
| 2000-01-01T00:00:00Z | Midnight UTC, Jan 1, 2000 | 
| 2018-08-01T14:00:00Z | 2 PM UTC, Aug 1, 2018 | 
🔐 Password Cracking
| Field Combination | Meaning | 
|---|---|
| password_type = plaintext | Original password was plaintext | 
| password_type ≠ plaintext+password_plaintextpresent | SpyCloud cracked the original hashed password | 
🧾 UUIDs
SpyCloud uses UUID v4 strings to uniquely identify:
- Breach catalog entries (uuid)
- Breach records (document_id)
Examples:
- ae375975-894b-489c-876b-a294dddf0c96
- b17d51a7-f4a9-479a-8820-801753e86c05
📚 Breach Catalog Schema
SpyCloud-generated assets such as source_id, domain, email_domain, target_domain, target_subdomain, password_type, sighting, severity or other related data points. Below is the full list of metadata fields available for each breach in SpyCloud’s breach catalog.
| Field | Type | Description | 
|---|---|---|
| 
 | int | Numerical breach ID. Correlates to  | 
| 
 | string | UUID v4 encoded version of breach ID (used in Firehose file naming). | 
| 
 | string | Breach title (if disclosable). Otherwise uses a generic label. | 
| 
 | string | Breach description (if disclosable). Otherwise uses a generic summary. | 
| 
 | string | Website of the breached organization (if known). | 
| 
 | string | Description of the breached organization (if available). | 
| 
 | string | Indicates if the breach is  | 
| 
 | int | Number of records parsed, normalized, and deduplicated from this breach. | 
| 
 | datetime | Date when the breach was ingested and made available to customers. | 
| 
 | datetime | Date SpyCloud first acquired the breach. | 
| 
 | datetime | Estimated date the breach occurred. | 
| 
 | datetime | When the breach was disclosed publicly. | 
| 
 | list | List of media URLs referencing the breach. | 
| 
 | dictionary | Dictionary mapping each asset to its count in the breach. | 
| 
 | int | Confidence score in the breach source. | 
| 
 | string | Indicates if the breach is a combo list. | 
| 
 | string | Classifies as  | 
| 
 | string | Further categorization:  | 
| 
 | list | Companies allegedly or confirmed involved in the breach. | 
| 
 | list | Companies determined to be targeted by the breach. | 
| 
 | list | Industries believed to be targeted by the breach. | 
| 
 | boolean | Indicates whether the source is sensitive or restricted. | 
| 
 | string | Malware family name (only if  | 
| 
 | string | 
 
 | 
🧾 Breach Record Schema
Breach Records are collections of data assets parsed and ingested into our data platform. Each Breach Record contains multiple assets, some of which are extracted directly from the parsed data while others are generated by the SpyCloud ingest process. Below is the list of SpyCloud generated assets which will always be present in a breach record.
| Field | Type | Description | 
|---|---|---|
| document_id | UUID | Unique ID per identity | 
| source_id | int | Links to breach catalog | 
| spycloud_publish_date | datetime | When record was published | 
| severity | int | Numerical value based on record contents (2–26) | 
🐛 Infected User Assets
Below is a table of all infected user (botnet) assets which might be present in a breach record.
| Field | Type | Description | 
|---|---|---|
| av_softwares | string | List of antivirus software installed on the machine. | 
| display_resolution | string | The system display resolution. | 
| form_cookies_data | string | Cookie data associated with this person. | 
| form_post_data | string | Form post data associated with this person. | 
| infected_machine_id | string | A unique identifier either extracted from an infostealer log, or an RFC 4122-compliant UUID generated by SpyCloud when none is present. Format and origin may vary by malware family. | 
| log_id | string | A deterministic SHA256 hash computed from the contents of a malware log archive. | 
| infected_path | string | Local path to the malicious software installed on the infected system. | 
| infected_time | string | The time the system was infected with malware. | 
| keyboard_languages | string | Keyboard languages configured in the OS. | 
| logon_server | string | Logon server captured from the infected environment. | 
| mac_address | string | 12-character alphanumeric MAC address of the device. | 
| port | string | Network port paired with IP address. | 
| system_install_date | datetime | Time at which the system OS was installed. | 
| system_model | string | Model identifier of the infected system. | 
| target_domain | string | Second-level domain (SLD) extracted from target_url. | 
| target_subdomain | string | Full subdomain + domain extracted from target_url. | 
| target_url | string | URL captured via keylogger or form capture. | 
| user_agent | string | Browser user-agent string. | 
| user_browser | string | Browser name used by the infected user. | 
| user_hostname | string | Hostname of the infected system. | 
| user_os | string | Operating system name. | 
| user_sys_domain | string | System domain name. | 
| user_sys_registered_organization | string | Name of the system's registered organization. | 
| user_sys_registered_owner | string | Name of the system's registered owner. | 
🔐 Credentials & Account Assets
Below is a table of all credential and account related assets that may appear in breach records.
| Field | Type | Description | 
|---|---|---|
| account_caption | string | Account profile caption. | 
| account_id | string | Account number or ID. | 
| account_image_url | string | Account image URL. | 
| account_last_activity_time | datetime | Timestamp of last account activity (ISO 8601). | 
| account_login_time | datetime | Last account login time (ISO 8601). | 
| account_modification_time | datetime | Account modification date (ISO 8601). | 
| account_nickname | string | Account nickname. | 
| account_notes | string | Account notes. | 
| account_password_date | datetime | Date when account password was set (ISO 8601). | 
| account_secret | string | Account secret answer. | 
| account_secret_question | string | Account secret question. | 
| account_signup_time | datetime | Account signup date (ISO 8601). | 
| account_status | string | Account status. | 
| account_title | string | Account title. | 
| account_type | string | Account type. | 
| api_token | string | API token. | 
| api_token_secret | string | API token secret. | 
| backup_email | string | Backup email address. | 
| backup_email_username | string | Username extracted from backup_email(before the@). | 
| domain | string | Domain name. | 
| email | string | Email address. | 
| email_domain | string | Extracted domain from email(after the@). | 
| email_username | string | Extracted username from email(before the@). | 
| num_posts | int | Number of posts (typically forum-related). | 
| password | string | Account password (original form). | 
| password_plaintext | string | Plaintext version of the password (if cracked). | 
| password_type | string | Password type (e.g., plaintext, SHA1, MD5). | 
| private_key | string | SHA256 hash of private key. | 
| private_key_password | string | Password for the private key. | 
| public_key | string | SHA256 hash of public key. | 
| salt | string | Password salt. | 
| service | string | Service associated with credential (e.g., Spotify, Netflix). | 
| service_expiration | string | Service expiration date (ISO 8601). | 
| username | string | Username. | 
SpyCloud has 200+ data types we recaptured as identity artifacts with a rich & diverse subset of fields across PII, financial, geographical, and social media assets.
SpyCloud maintains backward compatibility during a major version lifecycle.
- We never remove or rename existing fields in a current product version
- New fields may be added without notice — clients should handle unknown keys
- Breaking changes will only happen during a major upgrade (e.g., v1 → v2)
Updated about 2 months ago