Filippo Ricca, Paolo Tonella, Christian Girardi and Emanuele Pianta,
An Empirical Study on Keyword-based Web Site Clustering
Abstract
Web site evolution is characterized by a limited support to the
understanding activities offered to the developers. In fact, design
diagrams are often missing or outdated. A potentially interesting option is
to reverse engineer high level views of Web sites from the content of the
Web pages. Clustering is a valuable technique that can be used in this
respect. Web pages can be clustered together based on the similarity of
summary information about their content, represented as a
list of automatically extracted keywords.
This paper presents an empirical study that was conducted to determine the
meaningfulness for Web developers of clusters automatically produced from
the analysis of the Web page content. Natural Language Processing (NLP)
plays a central role in content analysis and keyword extraction. Thus, a
second objective of the study was to assess the contribution of
some shallow NLP techniques to the clustering task.
Postscript version of the paper.