Data Schema

🧭 Introduction

This document describes the schema for SpyCloud's breach data repository.

At a high level, our data is organized into two hierarchical structures: Breach Catalog and Breach Records.


🗂️ SpyCloud Data Hierarchy

Breach Catalog: A collection of breaches ingested into our platform. Each catalog entry contains metadata like breach title, acquisition date, affected domains, etc.

Breach Record: A structured collection of data assets extracted from the breach — including credentials, device metadata, and identity attributes grouped per user/persona.

Example: If a breach contains email, password, and phone for five users, we produce 5 records, each with 3 data assets.

Some data assets are extracted directly; others are generated by SpyCloud for added value (e.g., email_username, target_domain).


🧪 Data Normalization

AssetNormalization
emailLowercased for indexing
usernameLowercased
social_*Lowercased (e.g., social_twitter)
*_timeISO 8601 datetime
*_dateISO 8601 format
dobISO 8601 datetime (time component may be irrelevant)

🧬 Character Encoding

All data is normalized to UTF-8. Use UTF-8 handling per language:

  • Java: new String(bytes, "UTF-8")
  • Python 2: unicode(obj, "utf-8")
  • Python 3: str(obj, encoding="utf-8")
  • Go: Native UTF-8 handling
  • Perl: Encode::decode("utf8", $bytes)

🗓️ Date and Time Format

All timestamps follow ISO 8601 format: YYYY-MM-DDTHH:MM:SSZ

ExampleDescription
2000-01-01T00:00:00ZMidnight UTC, Jan 1, 2000
2018-08-01T14:00:00Z2 PM UTC, Aug 1, 2018

🔐 Password Cracking

Field CombinationMeaning
password_type = plaintextOriginal password was plaintext
password_type ≠ plaintext + password_plaintext presentSpyCloud cracked the original hashed password

🧾 UUIDs

SpyCloud uses UUID v4 strings to uniquely identify:

  • Breach catalog entries (uuid)
  • Breach records (document_id)

Examples:

  • ae375975-894b-489c-876b-a294dddf0c96
  • b17d51a7-f4a9-479a-8820-801753e86c05

📚 Breach Catalog Schema

SpyCloud-generated assets such as source_id, domain, email_domain, target_domain, target_subdomain, password_type, sighting, severity or other related data points. Below is the full list of metadata fields available for each breach in SpyCloud’s breach catalog.

Field

Type

Description

id

int

Numerical breach ID. Correlates to source_id in breach records.

uuid

string

UUID v4 encoded version of breach ID (used in Firehose file naming).

title

string

Breach title (if disclosable). Otherwise uses a generic label.

description

string

Breach description (if disclosable). Otherwise uses a generic summary.

site

string

Website of the breached organization (if known).

site_description

string

Description of the breached organization (if available).

type

string

Indicates if the breach is public (found online) or private (exclusive to SpyCloud).

num_records

int

Number of records parsed, normalized, and deduplicated from this breach.

spycloud_publish_date

datetime

Date when the breach was ingested and made available to customers.

acquisition_date

datetime

Date SpyCloud first acquired the breach.

breach_date

datetime

Estimated date the breach occurred.

public_date

datetime

When the breach was disclosed publicly.

media_urls

list

List of media URLs referencing the breach.

assets

dictionary

Dictionary mapping each asset to its count in the breach.

confidence

int

Confidence score in the breach source.

combo_list_flag

string

Indicates if the breach is a combo list.

breach_main_category

string

Classifies as combolist, breach, or malware.

breach_category

string

Further categorization: combolist, exfiltrated, exposed, infostealer, phished, scraped, or unknown.

breached_companies

list

Companies allegedly or confirmed involved in the breach.

targeted_companies

list

Companies determined to be targeted by the breach.

targeted_industries

list

Industries believed to be targeted by the breach.

sensitive_source

boolean

Indicates whether the source is sensitive or restricted.

malware_family

string

Malware family name (only if breach_category is infostealer).

status

string

Pending: Indicates that a new breach has been identified and is in the process of being ingested. Note: You can start consuming the incomplete breach records even if the catalog entry says as pending.

Verified: Indicates that ingestion is complete and all records from the breach have been successfully loaded, validated, and are ready for use.


🧾 Breach Record Schema

Breach Records are collections of data assets parsed and ingested into our data platform. Each Breach Record contains multiple assets, some of which are extracted directly from the parsed data while others are generated by the SpyCloud ingest process. Below is the list of SpyCloud generated assets which will always be present in a breach record.

FieldTypeDescription
document_idUUIDUnique ID per identity
source_idintLinks to breach catalog
spycloud_publish_datedatetimeWhen record was published
severityintNumerical value based on record contents (2–26)


🐛 Infected User Assets

Below is a table of all infected user (botnet) assets which might be present in a breach record.

FieldTypeDescription
av_softwaresstringList of antivirus software installed on the machine.
display_resolutionstringThe system display resolution.
form_cookies_datastringCookie data associated with this person.
form_post_datastringForm post data associated with this person.
infected_machine_idstringA unique identifier either extracted from an infostealer log, or an RFC 4122-compliant UUID generated by SpyCloud when none is present. Format and origin may vary by malware family.
log_idstringA deterministic SHA256 hash computed from the contents of a malware log archive.
infected_pathstringLocal path to the malicious software installed on the infected system.
infected_timestringThe time the system was infected with malware.
keyboard_languagesstringKeyboard languages configured in the OS.
logon_serverstringLogon server captured from the infected environment.
mac_addressstring12-character alphanumeric MAC address of the device.
portstringNetwork port paired with IP address.
system_install_datedatetimeTime at which the system OS was installed.
system_modelstringModel identifier of the infected system.
target_domainstringSecond-level domain (SLD) extracted from target_url.
target_subdomainstringFull subdomain + domain extracted from target_url.
target_urlstringURL captured via keylogger or form capture.
user_agentstringBrowser user-agent string.
user_browserstringBrowser name used by the infected user.
user_hostnamestringHostname of the infected system.
user_osstringOperating system name.
user_sys_domainstringSystem domain name.
user_sys_registered_organizationstringName of the system's registered organization.
user_sys_registered_ownerstringName of the system's registered owner.


🔐 Credentials & Account Assets

Below is a table of all credential and account related assets that may appear in breach records.

FieldTypeDescription
account_captionstringAccount profile caption.
account_idstringAccount number or ID.
account_image_urlstringAccount image URL.
account_last_activity_timedatetimeTimestamp of last account activity (ISO 8601).
account_login_timedatetimeLast account login time (ISO 8601).
account_modification_timedatetimeAccount modification date (ISO 8601).
account_nicknamestringAccount nickname.
account_notesstringAccount notes.
account_password_datedatetimeDate when account password was set (ISO 8601).
account_secretstringAccount secret answer.
account_secret_questionstringAccount secret question.
account_signup_timedatetimeAccount signup date (ISO 8601).
account_statusstringAccount status.
account_titlestringAccount title.
account_typestringAccount type.
api_tokenstringAPI token.
api_token_secretstringAPI token secret.
backup_emailstringBackup email address.
backup_email_usernamestringUsername extracted from backup_email (before the @).
domainstringDomain name.
emailstringEmail address.
email_domainstringExtracted domain from email (after the @).
email_usernamestringExtracted username from email (before the @).
num_postsintNumber of posts (typically forum-related).
passwordstringAccount password (original form).
password_plaintextstringPlaintext version of the password (if cracked).
password_typestringPassword type (e.g., plaintext, SHA1, MD5).
private_keystringSHA256 hash of private key.
private_key_passwordstringPassword for the private key.
public_keystringSHA256 hash of public key.
saltstringPassword salt.
servicestringService associated with credential (e.g., Spotify, Netflix).
service_expirationstringService expiration date (ISO 8601).
usernamestringUsername.

🧩OTHER ASSETS

SpyCloud has 200+ data types we recaptured as identity artifacts with a rich & diverse subset of fields across PII, financial, geographical, and social media assets.

🧬VERSIONING POLICY

SpyCloud maintains backward compatibility during a major version lifecycle.

  • We never remove or rename existing fields in a current product version
  • New fields may be added without notice — clients should handle unknown keys
  • Breaking changes will only happen during a major upgrade (e.g., v1 → v2)