Categories:

Nate Robinson: Algorithmically discovering structural form in data has been a topic of interest for several years. In 2008 Charles Kemp and Joshua Tenenbaum developed an algorithm to discover structural forms (structures such as partitions, trees, hierarchies, orders, chains, rings, grids, and cylinders) represented in data. Their published paper, The Discovery of Structural Form, has 511 citations on Google Scholar. Kemp and Tenenbaum used human-labeled data to form feature vectors for graph entities. Their algorithm accepts a set of such vectors as a feature matrix and probabilistically configures data entities into each of the structural forms I listed above, each one associated with a form likelihood. (The idea is that, for example, data more suited to be represented as a tree than a ring will have a higher tree-likelihood than ring-likelihood.)

A more recent development in the field of data analysis has been the advent of high-speed GPUs and the popularization of neural networks. In 2013 Mikolov et. al. introduced word2vec, an unsupervised neural framework to learn numerical n-dimensional vector representations of words (called word embedding vectors). These embedding vectors enable us to understand word relationships in a quantifiable numeric way. Embedding networks learn to maximize the cosine-similarity between vectors for words that tend to appear in similar contexts. For instance, the highly-related words “dog” and “cat” will be represented by embedding vectors that are cosine-similar. Word2vec later inspired more advanced algorithms that generate similar word embeddings (e.g. GloVe, Universal Sentence Encoder, and FastText).

In 2019 Zac Brown and I were studying structural forms. Because of the structured nature of word embedding vectors (where similar words have similar embedding vectors), we used these embedding representations as feature vectors for Kemp and Tenenbaum’s 2008 algorithm. We were able to find interesting structural forms for word categories like animals, foods, and cities in the vector spaces from GloVe and Universal Sentence Encoder embedding algorithms. This was an intriguing way to combine an algorithm from 2008 with a more recent neural capability. Here are some examples of tree structures and one order structure generated from Kemp and Tenenbaum’s algorithm.

Animal tree from GloVe 300-dimensional embedding vectors. Notice the fine relational structure of the tree. In general animals with shared characteristics are found in the same clusters. Notice the clusters at nodes 52 and 57 seem overfull, and that the cluster at node 54 stands out for its lack of accurate relational structure. Also notice that, since word embedding algorithms learn vector similarity based on word context, there seem to be clusters for different animal settings (e.g. “the farm” at node 48 and “the woods” at node 49).
Animal tree from Universal Sentence Encoder embedding vectors. Notice the differences in relational structure compared to the tree from GloVe vectors above. The structure in this tree is more consistent. Note two small but curious irregularities. It’s unclear why “eagle” and “wolf” are so closely related. And although “rhino” is close to “elephant”, it seems like an inaccuracy that it is closer to “tiger”.
Foods tree from GloVe 300-dimensional embedding vectors. Notice the impressive organization into what appear to be the major food groups.
Cities tree from GloVe 300-dimensional embedding vectors. Note the geographically consistent organization, with individual clusters for different continents or regions.
Animals order from GloVe 300-dimensional embedding vectors. This is particularly impressive, as the algorithm produced what appears to be an inverse-evolutionary order (from insects to fish to reptiles and birds to mammals).

These observations through structural forms give us interesting insight into the depth of knowledge learned by word embedding algorithms.

Zac and I were also interested in the concept of Graph Neural Networks (GNNs). These networks interface with a graph structure and perform tasks such as node classification. To perform such tasks GNNs need to learn an internal representation of graph structures. (See https://towardsdatascience.com/a-gentle-introduction-to-graph-neural-network-basics-deepwalk-and-graphsage-db5d540d50b3 to learn more.) Zac and I hypothesized that neural networks trained on other tasks (such as large language models) might actually be learning a graph structure too in an unsupervised way. We experimented with OpenAI’s GPT-2 (a Transformer model for text generation) and attempted to use Kemp and Tenenbaum’s algorithm to find graphical structure in the hidden layers within the GPT-2 network. Unfortunately incompatibilities between our data and the brittle 2008 algorithm, difficulty interpreting GPT-2’s inner representations, and pressing time constraints proved prohibitive to our research at the time.