Draft Tagging Guidelines for TB Research Perspectives

Project and Document Analysis

Consider the kind of materials you’ve chosen to focus on for this project, and consider the nature
of the digital publication you are (hypothetically) planning. Then briefly answer the following
questions:

1. What are the significant informational features of the documents that will need to be represented?

We will want to capture these significant data points: References to

  • Article type
  • Clinical institutions
  • Names of all individuals mentioned
  • Names of Journals cited
  • Titles of articles cited
  • Authors of cited articles
  • Individual studies
    • Year of study
    • Type of study
    • Supporting institution
    • Study objective(s)
    • Study methodology
    • Results
    • Conclusions
  • Test types
  • Type ID
  • Frequency (or total number) of observations
  • Relative effectiveness
  • Technologies employed
  • Sample population characteristics
  • Age of sample population
    • Age range
    • Mean age
    • Sample size
    • Geographic origin of population
    • Urban
    • Rural
  • Diagnostic statistics of population
  • Symptoms presented (including asymptomatic)
  • Types of TB observed
  • Rapidly progressive lesions
  • Slowly progressive lesions
  • Rapidly retrogressive lesions
  • Slowly retrogressive lesions
  • TB disease stage
  • Disease resistance
  • Mortality
  • Behavioral differences observed
  • Period of latency before presentation of symptoms

2. Who is the primary audience? Do they have special needs that can be supported through the encoding of the document?

The primary audience is researchers. The data points identified above are intended to produce results for queries among all the historical research articles. Thus, a researcher may want to know the age, type and stage of infection of patients from rural settings mentioned in the body of research articles contained by the site. Or she may want to know which of the articles were published by a particular journal during a particular decade to discern a particular interest or bias of approach. In addition to infectious disease scientists, historians of science may wish to pursue a particular hypothesis or policy advocates may wish to see if a particular approach had any effect during a particular time or in a particular place.

3. What functions do you want to provide for your audience: what kinds of searching? What
kinds of navigation?

Our initial list of research articles is relatively small (about 40). This permits us to contemplate re-entering all of the articles by hand in order to have a fully digitized text. With texts in this format we can more easily identify the elements we wish to tag and carry out the tagging. We would then be able to offer our visitors access to the data in two complementary formats. One would be a simple keyword search using a search device such as the one provided by Google. The other would be creating a database query function using category and keyword drop down menus to make explicit the database searches possible. Using PHP and MySQL, the queries constructed by researchers would be dynamic and take advantage of the functionality offered by a relational database.

4. What are the significant chunks or subdivisions of your documents?

I will use the The NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials as my guide in subdividing my documents into distinct classes of metadata. That guide identifies “descriptive,” “administrative,” “structural,” “text,” and “image” metadata categories. (This project does not include any audiovisual content—on additional category proposed by the NINCH Guide—and will profit by avoiding the added burden of digitizing and supporting streaming media and its infrastructural requirements and costs.)

For my purposes, the descriptive metadata would include all the citation information for all the historic research articles, sources and credits for photos, posters, Library of Congress subject headings, etc. The administrative metadata would include any information a future project manager would need to keep the TB Research Perspectives website running smoothly should I no longer be responsible for the site’s maintenance. This would include image resolutions and formats, contacts and access to any digital resources not housed on local servers. The structural metadata would include anything and everything the webmaster or web developer should include in their documentation. For example, every element with a distinct function in the PHP code should have a comment identifying its function. Additionally, the tables and structure of tables in the MySQL database would also be included as “comments” embedded in the PHP code. All of this documentation and commentary would constitute the structural metadata. As for the text and image metadata, the text metadata would include all of the items or types of items mentioned above in response to question 1. The image metadata will endeavor to include in as simple and standardized format as possible a classification of visual elements. This would include descriptors of the original graphical material such as:

  • Medium
    • Photograph
    • Painting
    • Sketch
    • Political Cartoon
    • Public Policy Poster
  • Type of chart, graph or map
  • Subjects depicted in image:
  • Children
  • Researchers
  • Clinicians
  • Adults
  • Poverty
  • Environmental conditions
  • Geography
    • Region
    • Country
    • City
  • Time frame of creation
  • Year
  • Decade
  • Author: painter, photographer, etc.
  • Copyright or permissions

5. List as many as possible of your documents’ significant features that you would want to encode, and provide a justification for encoding these features. Think about audience, likely uses of the information, and the balance of cost and benefit.

The “significant features” I would encode are listed in the response to the first question. The justification for this level of detail is that some of the categories embraced in that response will apply to some of the documents and others will not. The list of significant features may need to be adjusted as more of the documents are analyzed, but by capturing as many of these details as possible we will be able to offer the maximum of extractable information for the visiting researcher, who is the primary intended audience. Even with a relatively small number of core documents (about 40) this will be an estimable task. The project will definitely wish to purchase an XML encoding tool to assist in this effort. As this project is unique and the value of research may actually increase even as time passes, this effort will not be in vain. The number of cases is so great, the drug regimen so time consuming and costly (particularly in less developed countries) that extremely drug-resistant strains of Mycobacterium tuberculosis are certain to become more widespread. This effectively sets the research clock back to the end of the last period of great discoveries—the time frame or our “archive.”

6. What are the significant presentational features of your document? How much of this information do you consider important to capture?

There are two presentational features of the TB Research Perspectives website that make it virtually unique. First, there is no comparable archive of historic research articles. Second, there is no other comparable website combining the scholarly and technical aspects of TB research with the humanistic aspects of the individual and social experience. And finally, there is no website (at least that I have been able to find) that places the technical and the humanistic aspects into an international historical perspective. (I am taking the term “document” in this question to refer to the entire set of documents included in the website so that “document”=”website.”

7. What kinds of regularization of your document—if any—would be useful and appropriate?
Would you regularize silently or preserve the original reading? Again, think about audience and probable use (including long-term use) of the data.

The only standardization or “regularization” the website intends to impose is through the use of standard vocabulary (where that may be necessary) of technical terms that may have changed during the period of the article collection. No changes will be made to the original text other than to insert hypertext annotations to any terms that may have changed or evolved or been corrected by later research. As a safeguard against misinterpretation or suspicion of inappropriate manipulation, all of the articles that have been digitized (virtually the entire collection) will offer a link to a scanned PDF version displaying a facsimile copy of the original paper document.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License