Page tree
Skip to end of metadata
Go to start of metadata

The BitCurator.edu project team is building a library of datasets for educational, testing, and research purposes.

General Resources about Digital Forensics Sample Data

A comprehensive review of available datasets for cyber forensics research was presented at the 2017 Digital Forensics Research Workshop.

Forensics Wiki list of forensic corpora:

Digital Corpora

Using Digital Corpora for Educational Purposes

This is arguably the most directly applicable and widely used source for sample data in digital forensics education.  Simson Garfinkel and his collaborators have developed several realistic corpora for digital forensics education and research, available at http://digitalcorpora.org

These include “scenarios,” which represent fictional but realistic events.  For example, UNC SILS frequently uses the M57-Patents Scenario for classes and a variety of continuing education offerings, including conference workshops and the Digital Archives Specialist (DAS) digital forensics courses offered through the Society of American Archivists.  The full hard drive images are of a manageable size for longer assignments and exercises that require a drive with a full operating system; the USB flash drive images are smaller and well-suited for short workshops, class exercises and demonstrations.

Using Digital Corpora to Test Archival Workflows

These include “scenarios,” which represent fictional but realistic events.  The full hard drive images are of a manageable size for testing that requires a drive with a full operating system; the USB flash drive images are smaller and well-suited for testing more targeted functionality.

Using Digital Corpora for Information Science Research

CD-ROM and Floppy Disk Library – Indiana University

Online collection of "nearly 5,000 CD-ROMs, DVDs, and floppy disks distributed by the GPO under the Federal Depository Library Program (FDLP). These tangible products have been received through the FDLP since the 1980s and consist of millions of individual files containing fundamental data on economics, the environment, population, and life and physical sciences."

https://webapp1.dlib.indiana.edu/virtual_disk_library/

Using the CD-ROM and Floppy Disk Library for Educational Purposes

Each disk image has an associated catalog record.  For use in classes, one approach is to share a subset of the images directly with students, so they don’t immediately encounter all of the descriptive metadata from the site.

Using the CD-ROM and Floppy Disk Library to Test Archival Workflows

Using the CD-ROM and Floppy Disk Library for Information Science Research

Preservation and Access Virtual Education Laboratory (PAVEL)

The Preservation and Access Virtual Education Lab (PAVEL) project at the University of Michigan School of Information (2010-2012), funded by the National Endowment for the Humanities, developed a “virtual education laboratory featuring digital access and preservation tools.”  This includes four data sets:

  • a small set of six files for testing preservation tools
  • Elena Kagan's email (approximately 19,000 messages) from her tenure in the White House
  • the Enron email collection (approximately 600,000 messages)
  • a set of documents from the University of Michigan College of Literature, Science, and the Arts IT department (approximately 450 megabytes of Microsoft Office files)

If you would like access to these datasets, you can email the BitCurator.edu project team.

Using PAVEL for Educational Purposes

PAVEL data sets have the advantage of being selected for use in archival education, and they contain important supplementary information (e.g. organizational charts) for understanding their contexts of creation.  However, they do not allow for the replication of many forensic tasks.  For example, all of the email messages are stored as separate ASCII text files, rather than being embedded in a disk image or within a wrapper format such as PST, a scenario likely to be encountered during acquisition of a new collection.

Using PAVEL to Test Archival Workflows

Using PAVEL for Information Science Research

Corpora Designed for Targeted System Testing

Brian Carrier’s valuable, but highly targeted Digital Forensics Tool Testing Images:

Computer Forensic Reference Data Sets (CFReDS) from the National Institute for Standards and Technology (NIST):

Forensic Focus Test Images and Forensics Challenges:

Using Targeted System Testing Reports for Educational Purposes

Using Targeted System Testing Reports in Developing Archival Workflows

Using Targeted System Testing Reports for Information Science Research

Open Planets Foundation

The Open Planets Foundation maintains a site that includes “Datasets, preservation and curation Issues with those Datasets, and Solutions to those Issues.”  The experiences of solving specific Issues are written up on Solution pages, which then link to pages in the OPF Tool Registry.  In many cases, this leads to “actual code that can be downloaded and re-used.”

Using OPF for Educational Purposes

Using OPF for Testing Archival Workflows

Using OPF for Information Science Research

National Institute of Standards and Technology

The National Institute of Standards and Technology creates and maintains data sets including disk images for testing forensic tools. Some of these are deprecated over time; the current list can be found at:

Using NISO for Educational Purposes

Using NISO for Testing Archival Workflows

Using NISO for Information Science Research


  • No labels