One of the things that we learn about in Library School is how important controlled vocabularies are. Not every student loves the cataloging class, but we all have to take one as part of the core coursework. Learning about the importance of controlled vocabularies and the role that they play in the creation and retrieval of information is key to the world of library and information science.
One of the introductory exercises that we perform as students in cataloging classes inevitably involves comparing searches using library catalogs with those using a search engine such as Google. The results inevitably reveal that searches using library catalogs most often contain fewer “hits” and that they are more likely to match what our search intentions are in the first place. Search engines, on the other hand, while they might return results that can guide us to information sources that we wish to discover, most often point us towards a plethora of many other resources which can often be completely unrelated to the search that we in fact have in mind.
So, why are the results returned in a Google search so different from those in a library catalog? Well, now I will get back to my original thoughts posted in this blog entry and talk a bit more about the idea of control. The reason for the difference in the results of the two types of searches comes down to a matter of imposed structure, in other words, control. In the world of library and information science, metadata about objects/items exist in the form of records that involve the use of things like the use of controlled vocabularies, thesauri, and ISO’s. There are rules that should be adhered to when creating metadata for the purposes of information retrieval. Without structure either embedded within them, or linked to them via metadata records, documents existing on the Internet are mere blocks of text, all alone in a big cyber world, with no way to connect to would-be information consumers.
Although there exist lots and lots of unstructured documents on the Internet, there are many efforts underway which do impose order on the chaotic world of the Web. One such effort, among many, is that of Wikimedia and its various projects. One example of Wikimedia projects is called Wikidata.
If you are interested in familiarizing yourself a bit with what Wikimedia is doing with its Wikidata project, click on the link provided for their address, scroll to the section entitled Contribute, and click on List of Properties Used in Wikidata Entries. Scroll down and notice a variety of “entities”. In this case, the entities that Wikidata wishes to narrow down happen to be very, very basic things such as person, place, organization, etc. Next click on, for example, the tab labeled Person and note a table that lists various things/characteristics listed in table format that Wikidata lists as properties about a person, along with recommended data types (i.e. strings, items, etc.) that one should use when describing various characteristics about a person. All of these recommendations represent an attempt to impose structure on those creating and contributing documents to the Internet.
What is the point of all of this control? Isn’t the Web supposed to be a savage place, wildly free, and without restriction? The point is that without some control, it would be (mostly) impossible to harness the power of the never-ending flow of information that is added to the Internet. In a nutshell, structure allows algorithms to extract information embedded within documents, as well as harvest information that exists in metadata repositories. Without some control/standardization, it would not be possible for organizations to harvest and share information as they do, nor would it be possible for algorithms to “interpret” data in such a way as to increase the dissemination and retrieval of information.