EXTRACTING THE MAIN CONTENT FROM WEB PAGES BY ANALYSING THE VISUAL CHARACTERISTICS OF THE ELEMENTS AND CONVERTING TO THE JSON FORMAT

UDC 004.023

  • Kargin Nikolay Sergeevich – Master’s degree student, the Department of Information Systems and Technologies. Belarusian State Technological University (13a, Sverdlova str., 220006, Minsk, Republic of Belarus). E-mail: hello@karh.in

  • Gurin Nikolay Ivanovich – PhD (Physics and Mathematics), Assistant Professor, the Department of Information Systems and Technologies. Belarusian State Technological University (13a, Sverdlova str., 220006, Minsk, Republic of Belarus). E-mail: ngourine@mail.ru

Key words: web, browsers, HTML, CSS, JSON.

For citation: Gurin N. I., Sahon E. S. Extracting the main content from web pages by analysing the visual characteristics of the elements and converting to the JSON format. Proceedings of BSTU, issue 3, Physics and Mathematics. Informatics, 2021, no. 1 (242), pp. 54–60 (In Russian). DOI: https://doi.org/10.52065/2520-2669-2021-242-2-54-60.

Abstract

The article discusses the algorithms of extraction of the main context from web pages. In addition, the article proposed the method of solving problems related to the difficulties of extraction of the main content. This method is based on the visual characteristics and internal content of web page elements. In the developed method the main content is defined by a single root element; this root element is converted to a JSON format containing unambiguous data types describing paragraphs, titles, images, videos, galleries, and other web page elements. The web browser is not required to display the JSON format; and this fact significantly expands its application capabilities in mobile and embedded technologies due to greater efficiency. Using the root element in the search method allows you to improve the quality of the extraction of the main content. Besides this, it speeds up the extraction during processing a large number of web pages on a single site and using permanent storage for the processed pages.

References

  1. State of the Web. Available at: https://httparchive.org/reports/state-of-the-web (accessed 05.11.2020).
  2. AMP on Google. Available at: https://developers.google.com/amp (accessed 05.11.2020).
  3. Turbo-stranitsy dlya vladel'tsev saytov [Turbo pages for website owners]. Available at: https://yandex.ru/adv/turbo (accessed 05.11.2020).
  4. SPA (Single-page application). Available at: https://developer.mozilla.org/en-US/docs/Glossary/SPA (accessed 05.11.2020).
  5. Top 15 Most Popular News Websites. August. 2020. Available at: http://www.ebizmba.com/articles/news-websites (accessed 05.11.2020).
  6. Custom Elements. 3 May. 2018. Available at: https://www.w3.org/TR/custom-elements/ (accessed: 05.11.2020).
  7. Puppeteer v 5.4.1. Available at: https://pptr.dev (accessed 05.11.2020).
  8. The Open Graph protocol. 2010. Available at: https://ogp.me (accessed 05.11.2020).
12.01.2021