Nate Robinson: Algorithmically discovering structural form in data has been a topic of interest for several years. In 2008 Charles Kemp and Joshua Tenenbaum developed an algorithm to discover structural forms (structures such as partitions, trees, hierarchies, orders, chains, rings, grids, and cylinders) represented in data. Their published paper, The Discovery of Structural Form, has 511 citations on Google Scholar. Kemp and Tenenbaum used human-labeled data to form feature vectors for graph entities. Their algorithm accepts a set of such vectors as a feature matrix and probabilistically configures data entities into each of the structural forms I listed above, each one associated with a form likelihood. (The idea is that, for example, data more suited to be represented as a tree than a ring will have a higher tree-likelihood than ring-likelihood.)
A more recent development in the field of data analysis has been the advent of high-speed GPUs and the popularization of neural networks. In 2013 Mikolov et. al. introduced word2vec, an unsupervised neural framework to learn numerical n-dimensional vector representations of words (called word embedding vectors). These embedding vectors enable us to understand word relationships in a quantifiable numeric way. Embedding networks learn to maximize the cosine-similarity between vectors for words that tend to appear in similar contexts. For instance, the highly-related words “dog” and “cat” will be represented by embedding vectors that are cosine-similar. Word2vec later inspired more advanced algorithms that generate similar word embeddings (e.g. GloVe, Universal Sentence Encoder, and FastText).
In 2019 Zac Brown and I were studying structural forms. Because of the structured nature of word embedding vectors (where similar words have similar embedding vectors), we used these embedding representations as feature vectors for Kemp and Tenenbaum’s 2008 algorithm. We were able to find interesting structural forms for word categories like animals, foods, and cities in the vector spaces from GloVe and Universal Sentence Encoder embedding algorithms. This was an intriguing way to combine an algorithm from 2008 with a more recent neural capability. Here are some examples of tree structures and one order structure generated from Kemp and Tenenbaum’s algorithm.
These observations through structural forms give us interesting insight into the depth of knowledge learned by word embedding algorithms.
Zac and I were also interested in the concept of Graph Neural Networks (GNNs). These networks interface with a graph structure and perform tasks such as node classification. To perform such tasks GNNs need to learn an internal representation of graph structures. (See https://towardsdatascience.com/a-gentle-introduction-to-graph-neural-network-basics-deepwalk-and-graphsage-db5d540d50b3 to learn more.) Zac and I hypothesized that neural networks trained on other tasks (such as large language models) might actually be learning a graph structure too in an unsupervised way. We experimented with OpenAI’s GPT-2 (a Transformer model for text generation) and attempted to use Kemp and Tenenbaum’s algorithm to find graphical structure in the hidden layers within the GPT-2 network. Unfortunately incompatibilities between our data and the brittle 2008 algorithm, difficulty interpreting GPT-2’s inner representations, and pressing time constraints proved prohibitive to our research at the time.
Comments are closed