Thomas Haschka, Joseph Bakarji
The paper introduces a nested density clustering method to create hierarchical semantic trees from text corpora using large language model embeddings, enabling data-driven discovery of research areas and their subfields without predefined categories.
Recent advancements in large language models (LLMs) have improved how we classify and understand text based on its meaning. However, understanding the overall structure of semantic relationships in large text collections remains challenging. This study proposes a new method to create hierarchical trees that show how different texts are related to each other in terms of meaning. By using this method, researchers can uncover the natural organization of research areas and subfields without needing to categorize them in advance. The approach is tested on various datasets, showing its effectiveness across different types of text collections.