Tagging Document Analysis

Document Analysis
1. What are the significant informational features of the documents that will need to be represented?

One of the informational features will be identifying the type of document. A manuscript schema for the metadata will work in most situations. However, it will be important to identify the features of the trial transcript. For example, users will need to know that the document a) is a transcript, b) there are two different sides (the university as the defense and the suspended professor as the prosecutor) presenting their case, c) the title/name of each representative/witness when that person has lines in the transcript. TEI will be most appropriate for these documents.
The exhibits will require Dublin Core. Many contain graphics, or will be digitized as objects (i.e. scanning the cover, sample pages), where the standard entry fields will suffice in providing information. For example, the program about the Spanish Civil War will contain metadata that describes the source of the material, and in the contributor field within a program such as Omeka will be the perfect place to distinguish between the prosecution and the defense in the presentation of exhibits.

2. Who is the primary audience? Do they have special needs that can be supported through the encoding of the document?

The primary audience will hopefully be somewhat diverse. For those interested in law, especially at the university level, the above distinction of prosecution and defense will be extremely useful. Additionally, because there will be a searchable glossary to the website, the metadata for researchers of all levels should be clear in connecting the various glossary terms in the text to their definitions. The subjects found within the text will also be necessary, especially for amateur scholars who could find related subject matter through the transcripts, trial exhibits, and supplementary material.

3. What functions do you want to provide for your audience: what kinds of searching? What kinds of navigation?

The search functions will require the metadata described above, as well as encoding that allows acronyms to be searchable under their full names and vice-versa. (e.g. HUAC House Un-American Activities Committee, complete with possible misspellings and use or non-use of hyphens, capital letters, etc.) I would also like to include the LoC subject headings as they appear in my section on added value.
Using TEI for one set of documents and Dublin Core for the other might present navigation issues because the metadata may not be consistent between the two systems. To counter this, I would hope that user tagging might bring to light some of the inconsistencies.

4. What are the significant chunks or subdivisions of your documents?

Significant chunks or subdivisions would be (for transcripts): date, prosecution/defense, speaker.

5. List as many as possible of your documents’ significant features that you would want to encode, and provide a justification for encoding these features. Think about audience, likely uses of the information, and the balance of cost and benefit.
- Document Title
- type of document (transcript, exhibit, etc. - to differentiate between the transcript and the supplementary materials; within the exhibits, the type of document, e.g. manual, brochure, article, etc.)
- speakers in transcript (prosecution, defense, name and organization of individuals, allowing the user to know who is speaking)
- subjects covered (LoC headings, etc. improving the range by which users can search)
- names (both proper names and also positions in both the case, and if applicable, Cold War history, with misspellings that apply, easing the issues with inconsistencies among user searches)
- creators (with exhibits, the original author of the document as well as which side of the trial has offered it into evidence)
- length (how many pages in a transcript, e.g. October 5, 1953, page 2/7, easier to navigate than only having the name OR the date available)

6. What kinds of regularization of your document—if any—would be useful and appropriate?
The only regularization that will be important would be in the search functions, since the OCR and the transcript are almost completely accurate in spelling. There are some inconsistencies in the transcript that would only need silent regularization, since preserving the edits in the original trial transcript makes sense to show the changes that have been made.

Here is a sample exhibit with the metadata I would like to include:
"The New Spanish Inquisition"
Title: "The New Spanish Inquisition"
Subject:
Creator: United American Spanish Aid Committee, New York
Contributor: Dean Pollock's Defense Advocate, New York University
Description: Pamphlet displaying members of United American Spanish Aid Committee with title graphic depicting persecution at the hands of fascists in Spain (Names encoded here, so that they can be searched for as well)
Identifier: PollockExhibit15
Date: December 1941
Source: Edwin Berry Burgum Academic Freedom Case, New York University Archives, RG 19, Box 4, Folder 1
Publisher: Unknown
Rights: Unknown
Format: Pamphlet
Language: English
Type:Pamphlet

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License