
DFRWS APAC 2025 - Paper accepted
1 August 2025
Our article on "Media Source Similarity Hashing (MSSH): A Practical Method for Large-Scale Media
Investigations" was accepted at DFRWS APAC 2025. The paper will be presented at the Digital Forensic Research Workshop (DFRWS) APAC 2025 in November 2025 in Seoul, South Korea.
Authors: Samantha Klier and Harald Baier
Abstract:
Hash functions play a crucial role in digital forensics to mitigate data overload. In addition to traditional cryptographic hash functions, similarity hashes - also known as approximate matching schemes~- have emerged as effective tools for identifying media files with similar content. However, despite their relevance in investigative settings, a fast and practical method for identifying files originating from similar sources is still lacking. For example, in Child Sexual Abuse Material (CSAM) investigations, it is critical to distinguish between downloaded and potentially self-produced material. To address this gap, we introduce a Media Source Similarity Hash (MSSH), using JPEG images as a case study. MSSH leverages structural features of media files, converting them efficiently into Similarity Digests using n-gram representations. As such, MSSH constitutes the first syntactic approximate matching scheme. We evaluate the MSSH using our publicly available source code across seven datasets. The method achieves AUC scores exceeding 0.90 for native images — across device-, model-, and brand-level classifications, though the strong device-level performance likely reflects limitations in existing datasets rather than generalizable capability — and over 0.85 for samples obtained from social media platforms. Despite its lightweight design, MSSH delivers a performance comparable to that of resource-intensive, established Source Camera Identification (SCI) approaches, and surpasses them on a modern dataset, achieving an AUC of 0.97 compared to their AUCs, which range from 0.74 to 0.94. These results underscore MSSH’s effectiveness for media source analysis in digital forensics, while preserving the speed and utility advantages typical of hash-based methods.