Rotc Tagging

Tagging Guidelines

The documents for this project are pulled from a variety of collections within the NYU Archives; however, a majority of these items have very similar features. The documents are primarily interdepartmental memos, letters, internal publications, flyers/postings (from student groups and administration) and handouts. As such, they are predominately typed documents that contain titles, authors/senders, receivers, and dates. These features will be primary access points when searching the documents. In order to best capture the nuances within this particular set of documents, XML tagging will be used, following a customized TEI schema using the P5 guidelines that is based on the manuscript description schema with elements added from the figures module to describe images and link the OCR transcriptions to the scanned images of the original documents.

Since they are typed documents and easy to read when scanned well, visitors to the site will primarily view the scans of the original documents, while OCR transcriptions of the documents will be used for searching. For this reason, full formatting of the transcribed pages will not be a high priority. Major presentational features will be captured including line and page breaks. There are no interlineations within documents, but there are handwritten notes on the bottom of a few, these will be captured at the end of each page within the transcription. Additionally, letterhead will also be captured. The departments and official titles of individuals will be important to provide the most complete search possible, so letterhead with this information will be useful to capture.

While the viewer will see the original documents, some minimal regularization will be required to facilitate searching in the transcribed versions. Full names of the Author/sender, receiver, and any person listed within the documents will be added, as well as the date, collection, and the departments and official titles of people will need to be added if they are missing. Also, spelling will be silently corrected to facilitate the searching as well. Considering the types of documents being digitized—memos, flyers, handouts, and other published documents the anticipated spelling errors are going to be minimal, if any. When the OCR transcriptions are reviewed, all spelling errors will be corrected. Very few of the documents are over 3 pages, so creating subdivisions within individual documents is not always necessary. Documents 4 pages or longer will be broken into 2 page subdivisions to make them easier to navigate.

There are a number of significant informational features, some mentioned above, that will need to be represented within each document. Here is an example of what will be tagged in a document, using the document AdHoc.5-1969.1.jpg:


Document Type: handout
Title: ROTC Field Day NYU: A Call to Conscience
Author/Sender: Ad Hoc Committee for the ROTC Field Day
Receiver: Ad Hoc Committee for the ROTC Field Day
Date: May 1968
Organization: Ad Hoc Committee for the ROTC Field Day
Subjects: Protest, Field Day, Cadets, Students, Johnson Administration
Collection: Student Protests (Archives H)
Item ID: AdHoc.5-1969.1.jpg
Copyright: New York University

All documents will be encoded with the following information:

- Document Type (e.g. memo, publication, letter, handout, posted flyer)
- Title
- Author/Sender
- Receiver (who a letter or memo was addressed to or the intended audience if known)
- Date
- Organizations (This is a catch-all term for the student groups, administrative offices and academic departments)
- Subject Keywords
- People
- Collection
- Box
- Folder
- Item ID (name of document, e.g. AdHoc.5-1969.1.jpg)
- Copyright

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License