Thursday, March 30, 2006

Creating metadata in more than one language

A conversation earlier this week got me thinking about creating metadata in multiple languages for every item described. In communicating about this with my students, a student in Japan agreed that this can be a problem. In Japan, they use three different character sets -- Chinese character, Hiragana, and Katakana ("kana") -- interchangeably to write Japanese. An author's name, for example, can be written using three different character sets. None of the character sets is more correct or more dominant. The student pointed to this piece of metadata to illustrate the problem. Notice that the title, imprint and author are written three times, each in a different character set (and in this case, the third character set is a more Romanized set).

The same problem would occur in some areas of the world where more than one language is spoken, and where neither language is more dominant.

Of course, you will immediately think about doing machine translations, but machine translation may not be accurate. Humans would do better work.

What is the impact? In some cases, this means that digitization is done in order to preserve the content, but not to provide access, since "access" means searching and searching means metadata. At the moment, there are digitization projects occurring in Japan, but most are not for access, due to the metadata problem (at least that is what I have been told). This is a problem -- and impact -- I had not considered. I'll be interested to hear how projects are overcoming it. Perhaps some solutions will come out of the projects being developed in the European Union.

4/18/2006: See a follow-up post here with updated/corrected info.

Technorati tags: ,

1 comment:

--~~~~ said...

Great read!!!
You have, of course, checked out, right? Use it to manually get all the biggest spiders in gear, on
demand. I use it every time I add a new post.
Online Earning