Models and Data Structures

Here in Library Land we have been modeling of data, information, and knowledge for a long time. These models were almost always representations of the materials owned by a library, and they have always been limited by their underlying data structures. Why not use one of the newer data structures -- vector databases?

Simple Lists

Pretty much the simplest data structure is the list, and using a list librarians created bibliographies and ultimately catalogs. Each item in the listd was described and elaborated upon with a number of characteristics, usually things such as authors, titles, and dates -- bibliographics. "Duh!" As library collections grew, so did the need for additional characteristics: extents, subjects, added entries, notes, etc. The cool thing about bibliographies and catalogs is two-fold: 1) they are easy to get one's head around, and 2) they are amenable to printing. Think the venerabale English Short Title Catalog or the National Union Catalog.

One-To-Many Relationships

With the advent of computers richer, more complex, and more nuanced data structures presented themselves. The most important feature of these computer-based data structures was their ability to easily manifest one-to-many relationships. MARC is such a data structure. Relational databases are another. So are XML and JSON. Given one-to-many relationships, models of library collections easily lend themselves to the identification of inexplicit connections between items. "Here is an item of iterest. What other items have similar characteristics, and to what degree do all of those similar items have other similarities?" Such was possible to implement and practice upon using those big cabinets of printed catalog cards, but relational databases and Linked Data are better examples.♦ ︎

Another advantage of these computer-based data structures is the abiliy to associate much larger volumes of data/information with each item in the list. The most notable of these associations is the full text of the item. Now, not only was it possible identify items of interest through its metadata (authors, titles, dates, etc.), but it is also possible to identify items via full text searching. Cool? Interesting? We have reached the ulimate in bibligraphy. Right? Wrong.

Large-Language Models; Vector Databases

The latest data structure to affect the way librarians (can) describe their collections is through the implementation of large-language models, sometimes called "word embeddings" or "vector databases". These are implemented by first parsing each of the characterisitcs of a bibliographic item into tokens -- think "words". Next, counting and tabulating the frequency of all the tokens. Third, saving the frequencies to disk. The result is a very wide and very deep matrix of real numbers. We're talking matrices with millions of lines and billions of columns. Taken as a whole, these matrixes are sets of vectors, and each vector "points" in different direction in an N-dimensional space, where N is the number of columns in the matrix.

Identifying items of interest in this space is done by garnering a query, vectorizing it, comparing it to all the other vectors in the set, and returning the items whose distances are shortest from the query. Just as importantly, identifying relationships between items is not limited to the librarian-created enumeration of bibliographic characteristics. Instead, a query and a set of one more more items of interest can be vectorized, compared to the whole, and additional items returned. Even more importantly, the query can be in the form of a question like "Who killed Hector?", and the results can be returned as narrative text, "In Homer's epic poem The Iliad, Hector was killed by Achilles." In other words, these large-language models (vector databases) can do more than return lists of items possibly/probably containing the answers to questions. Large-langauge models can return answers.

There seems to be a great deal trepdiation in the library community over the use of large-language models. I ask myself, "Why?" I believe this to be the case for at least a few reasons. First, existing large-language models are not rooted in library collections but instead on content scraped from the Internet. Thus, these models are biased because on the Internet anybody can say anything anytime, and such things are not always true. In other words, the models were created from content that is not curated or put into context. Similarly, large-language models created from content scraped from the Internet may be in violation of copyright laws, but the jury is still out on that one.

Second, as a whole, the library community does not understand how large-langauge models operate, and consequently the community is skeptical. This is to be expected.

Third, I believe the community feels threatened because large-language models descrease the need for libraries and librarians. Allow me to elaborate. People do not use libraries because they want a book. Instead, people use libraries to get a book and then use the book to solve some sort of problem or address some sort of question. Getting the book is merely a means to an end, not the end itself. Large-language models circumvent this process. Not only do large-language models get the book, they also "read" the book and address questions. No libraries or librarians needed, just a very large collection of digitized materials, a computer cluster the size of Walmart, and a cadre of software engineers. ♣ ︎

Solutions

[INSERT HERE WHAT TO DO CONSIDERING THE WHAT TO DO WITH THE ADVENT OF VECTOR DATABASES.]

Summary

[IN JUST A FEW SENTENCES, DISTILL WHAT WAS OUTLINED ABOVE.]

Notes

♦︎ Quite honestly, I had a lot of fun using those catalogs. Find an item of interest. Look at the subject headings. Look up the subject heading. Find additional items. Repeat. Fun!

♣ ︎Two things. First, I have to admit, for the first time in my career, I feel threatened by the new technology. The use of large-langauge models is taking much of the fun out of being a librarian. No more collecting. No more cataloging. No more curating. No more fiddling with indexing terms and techniques. No more cool search strategies. No more interfacing between the collection and the patron. Which leads me to the second thing. Actually, it does take more than a cadre of software engineers to create large-language models. Yes, software engineers are required, but those enginners need to know and understand the audience for whom the model is being created. This is where librarians and librarianship can play a role; librarians are expected to know their audience, and thus they can guide the design and development of large-language models. Such is the librarian's super power.


Creator: Eric Lease Morgan <eric_morgan@infomotions.com>
Source: These ideas were written down sometime in January of this year, but they are being published here for the first time.
Date created: 2026-05-14
Date updated: 2026-05-14
Subject(s): Large-Language Models;
URL: http://infomotions.com/musings/models-and-data-structures/