When to Encode Text

The reason for encoding text is to identify it, determine its function, its relationship to other data, how it should be displayed, and to make it easily shared.

When determining an encoding policy, you need to figure out what you want to take the time to encode and what can be left at a text-string searchable level.

  • Sorting - If you want to be able to sort your texts by the a piece of text within it (e.g. author, title, or date) you need to identify that text so that your program knows what it means.
  • Regularizing - When your data can be written in a number of different ways, you can use encoding to regularize it, which makes it easier to sort. An example would be date. Your computer cannot tell that Sept. 2, '50 and 1950/09/02 equal the same thing unless you tell it. Encoding a string of text allows you to regularize its meaning using attributes. Another example is names, which might be rendered in very different ways, but all refer to the same person.
  • Searching - If you want to search by more than one text string, such as all letters written from Paris in 1939 to Margaret Sanger, you need to identify each of those pieces of information. A text-string search might locate texts that included those bits of information, but it would also include a letter written by Margaret Sanger in London in 1959 in which she talks about the Paris Opera and provides a phone number that includes the number 1939 in it.
  • Ease of processing - Programs can do things faster and more accurately when the data is structured. You can adopt text mining software, create web forms that can "encode" text that is supplied, and use crowd sourcing to refine or add tags, all if the data is structured.
  • Sharing - Standardization systems allow data created at multiple sites to be viewed, shared, sorted, searched and processed together.
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License