Paolo Tonella, Filippo Ricca, Emanuele Pianta and Christian Girardi,
Using Keyword Extraction for Web Site Clustering
Reverse engineering techniques have the potential to support Web site
understanding, by providing views that show the organization of a site and
its navigational structure. However, representing each Web page as a node
in the diagrams that are recovered from the source code of a Web site leads
often to huge and unreadable graphs. Moreover, since the level of
connectivity is typically high, the edges in such graphs make the overall
result still less usable.
Clustering can be used to produce cohesive groups of pages that are
displayed as a single node in reverse engineered diagrams. In this paper, we
propose a clustering method based on the automatic extraction of the
keywords of a Web page. The presence of common keywords is exploited to
decide when it is appropriate to group pages together. A second usage of
the keywords is in the automatic labeling of the recovered clusters of
Postscript version of the paper.