Public Data Resource

Complex Document Information Processing (CDIP) dataset

Contact: Ian Soboroff..
Identifier: doi:10.18434/mds2-2531
Version: 1.1... First Released: 2022-02-04 Revised: 2022-04-20

Description

This dataset is called the "IIT CDIP collection". "CDIP" stands for "Complex Document Information Processing" and "IIT" stands for "Illinois Institute of Technology" who originally built the dataset. The dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s. As a result of the settlement of that lawsuit (the "Master Settlement Agreement"), the companies had to make all the documents public in an archive, which currently resides at UCSF, the University of California, San Francisco.IIT used this data to build a dataset of "messy" documents that were challenging for existing systems to process. There is handwriting on the documents, stains, etc. TREC used an automatic text conversion of this dataset in the TREC Legal Track, and we also have the original TIFF scans of the documents. The dataset consists of around 7 million documents, preprocessed with 90s-era OCR, and also the original page scans in TIFF format. See contact information in this record for access to this dataset.
Research Topics: Information Technology: Data and informatics    
Subject Keywords: optical character recognition, information retrieval, document structure, document understanding, image to text    

Data Access

These data are public.
Files

Loading file list...

About This Dataset

Version: 1.1... First Released: 2022-02-04 Revised: 2022-04-20
Cite this dataset
Ian Soboroff (2022), Complex Document Information Processing (CDIP) dataset, National Institute of Standards and Technology, https://doi.org/10.18434/mds2-2531 (Accessed 2024-07-27)
Repository Metadata
Machine-readable descriptions of this dataset are available in the following formats:
NERDm
Access Metrics
Metrics data is not available for all datasets, including this one. This may be because the data is served via servers external to this repository.