A Framework to Derive Web-Page Context from Hyperlink Structure: Deriving Web-Page Context
Since an anchor is used in an HTML document to point to a related document/picture/media application, anchor-text becomes a potential resource to extract the information about an associated web page. However, sometimes anchor-texts are either not present at all or a single word text / an image anchor is contained in the anchor tag. In these situations, the text surrounding a link or the link-context assumes importance in the sense that it can be used to derive the context of the target web page. In this paper, a dataset of about hundred web pages of different categories from Open Directory Project (ODP) has been surveyed and analyzed. The results show that cohesive text surrounding the anchor in the form of full sentences and non-cohesive text present elsewhere in the in-link web pages provides rich semantic information about a target web page, which in turn can be considered as the context of the target web page. Since, generally, there are several in-links for a target web page, a filtering mechanism, based on the linguistic analysis of all context-sentences, which filters the best described context sentence, has been developed and is being described and evaluated in this paper.
Keywords: Hyperlinks, Anchor-text, In-links, Link-context, Cohesive Text, Linguistic Analysis
Lecturer, Department of Computer Engineering, YMCA Institute of Engineering
Professor & Head, Department of Computer Engineering, YMCA Institute of Engineering