A Review on Identifying the Main Content From Web Pages
Journal: International Journal of Science and Research (IJSR) (Vol.4, No. 4)Publication Date: 2015-04-05
Authors : Madhura R. Kaddu; R. B. Kulkarni;
Page : 2630-2634
Keywords : DOM Tree; Content extraction; Web mining; Machine learning method; Web page Segmentation;
Abstract
A web page is a web document in which huge amount of information is available and because of rapid growth of World Wide Web there is a great advantage to anyone, the user can easily access the web pages from any place through the internet. In the web page contains noisy information like menus, footers, unnecessary links, logos, etc and the main content. Most of the users are interested in only main content. But the main problem with the extraction process is to greater performance impact on web summarization, question answering system, information retrieval application because of the web page is collection of noisy and main content. So we propose an extraction process for identifying main content from web pages. In the extraction process consist of an automatic extraction techniques and hand crafted rules. In the automatic extraction techniques process the first step is to the web page is segmented into web page block and the second step is to differentiate main content from irrelevant or noisy content. In the hand crafted rule process extracts the main content from web pages by using rules which are already generated.
Other Latest Articles
- Energy Efficient Data Aggregation of Wireless Sensor Network and Attacks using Error Bound
- Comparing the Role of Traditional and Incidental Vocabulary Teaching on Developing EFL Students? Vocabulary by Means of Interactive White Boards
- Selection of Circuit Breaker Rating for Symmetrical Fault Analysis on Transmission Lines
- N-Dimensional Plane Gravitational Waves with Background Metric
- Optimal Location of Statom Using Particle Swarm Optimization in IEEE-14 Bus System to Improve Voltage Profile&THD Reduction
Last modified: 2021-06-30 21:44:39