CrowdStrike data scientists have contributed to EMBER2024, a major update to the most widely used open benchmark dataset for training machine learning models to detect malware. Presented at the KDD-2025 conference in Toronto, the updated dataset expands from the original 2018 release to over 3.2 million files across six formats — and adds a specially constructed “challenge set” of evasive samples that existing models find hardest to classify.
Why Malware Detection Benchmarks Matter
Modern endpoint security products rely heavily on machine learning models to detect malicious files. Training those models requires large, labelled datasets of both malicious and benign files — but creating such datasets is expensive and time-consuming. The original EMBER dataset, released in 2018, gave researchers and security vendors a common benchmark to train against, measure progress, and compare techniques. It has been cited in academic research over 700 times since publication.
The problem with any benchmark is that it ages. As defenders improve and attackers adapt, the threat landscape evolves, and a model trained on 2018 malware characteristics may be less effective against 2025 threats. EMBER2024 addresses this directly.
What Is in EMBER2024?
The updated dataset contains over 3.2 million files spanning six file formats: Win32 executables, Win64 executables, .NET assemblies, Android APKs, PDFs, and ELF binaries (Linux/Unix). It supports seven classification tasks including malicious/benign detection, malware family classification, and behaviour identification.
A supplemental release adds raw bytes and disassembly data for 16.3 million functions extracted from malicious files using CrowdStrike’s FLARE team tool, capa. The source code for feature calculation and model training is fully included, allowing researchers to replicate and extend the work.
The Challenge Set: Training Against Evasion
The most significant addition is a challenge set of 6,315 files that were not detected as malicious by any VirusTotal antivirus product at the time of collection — but were later confirmed malicious after 30 days when sufficient AV products agreed. These are real-world evasive samples that slipped through existing defences.
The original EMBER benchmark model achieved a near-perfect ROC AUC score of 0.99911 on the standard test set — essentially too easy to be a meaningful measure of progress. The challenge set changes that: it highlights exactly the cases where models struggle, providing meaningful room for improvement and a realistic test of novel classification techniques.
Why This Matters Beyond Research Labs
Investments in open research datasets like EMBER2024 have direct downstream effects on commercial endpoint security products. Better training data produces more accurate models, which means fewer missed detections of evasive malware and fewer false positives disrupting legitimate business activity.
The challenge set in particular reflects a philosophy increasingly central to effective cybersecurity: the most important thing to train against is not the average threat, but the hardest-to-detect ones — because those are exactly the attacks that succeed.
CrowdStrike captures this well in their own framing: “When defenders collaborate and share knowledge, we collectively strengthen our position against the threat actors who benefit from operating in the shadows.”

