Project Development – Reformatting
Only rarely does a project involve material that already exists in digital form. Texts, images, audio,
or video that are to become part of a digital project are generally reformatted from their original
analog format to a digital format. This process may serve various purposes, including preserving rare
or deteriorating materials, providing easy access to materials such as photos and slides that are not
easily handled, providing greater access to heavily used materials, and making texts searchable.
When new materials are brought into the UWDCC they are first thoroughly inventoried
and evaluated. Based on the evaluation of the materials, UWDCC staff decide the best approach for them to
be reformatted. The parameters selected for digitizing a particular item represent a balancing act between
detail, clarity, faithfulness, and ever increasing storage requirements. During the evaluation process, a
determination is made as to the level of detail necessary to adequately capture the information carried by the original object.
There are many handling and technical aspects to getting good digital images, or "scans," of original physical objects such as photographs or pages from a book. The goal is to balance productivity and quality. Original materials that are not fragile can be disbound and are reformatted using a high-speed scanner. This type of scanner uses a feeding system, similar to a fax machine, which processes pages in large batches.
Books that cannot be disbound and other fragile items may be reformatted using flatbed scanners, similar to the types of scanners many people have in their homes.
Books and fragile items can also be reformatted using a more elaborate overhead scanning system called a Pulnix.
Adaptation to the flatbed scanner allows for scanning glass negatives, and special scanners are also used for reformatting slides. A high-quality, master derivative TIFF image (Tagged Image File Format) is initially produced when scanning. From this scan, a script automatically produces smaller-sized derivatives to be used on the web. Some additional work is also occasionally needed, such as cropping (removing part of the image) or rotating the images.
If the original work is a typeset text, such as a book, the scanned images may also be run through an optical character recognition (OCR) program. OCR is the method by which typeset and typed letters, numbers, and symbols are machine-read using optical sensing (usually a scanner) and a computer. A computer program analyzes the patterns and identifies the characters they represent, with some tolerance for less than perfect/uniform text. The end product is plain text, similar to what one might see in a word processing file. The clearer the original text page, the better the image-scan, and thus the more accurate the OCR'd text. Even the cleanest scan will not result in perfectly OCR'd text, however. The uncorrected OCR text is used for full-text keyword searching in the finished digital project.
