Extraction of Data from Mass Media Web Sites
- 作者: Varlamov M.1, Turdakov D.1,2,3, Yatskov A.1,2
-
隶属关系:
- Ivannikov Institute for System Programming, Russian Academy of Sciences
- Moscow State University
- National Research University—Higher School of Economics
- 期: 卷 44, 编号 5 (2018)
- 页面: 344-352
- 栏目: Article
- URL: https://journals.rcsi.science/0361-7688/article/view/176663
- DOI: https://doi.org/10.1134/S0361768818050092
- ID: 176663
如何引用文章
详细
To understand the current state and dynamics of the development of the Internet information space, fast tools for extracting data for mass media sites that have a large degree of coverage are needed. However, by no means all sites provide data syndication in the RSS format, and the development of specialized tools for extracting data from each Web site is a costly procedure. In this paper, methods for automatic extraction of news texts from arbitrary mass media sites are proposed. Due to classification of Web page types and the subsequent grouping of their URLs, the quality of extracting news texts is improved. A strategy for traversing a site and detecting the pages containing hyperlinks to news pages is proposed. This strategy decreases the number of requests and reduces the site load.
作者简介
M. Varlamov
Ivannikov Institute for System Programming, Russian Academy of Sciences
编辑信件的主要联系方式.
Email: varlamov@ispras.ru
俄罗斯联邦, Moscow, 109004
D. Turdakov
Ivannikov Institute for System Programming, Russian Academy of Sciences; Moscow State University; National Research University—Higher School of Economics
编辑信件的主要联系方式.
Email: turdakov@ispras.ru
俄罗斯联邦, Moscow, 109004; Moscow, 119991; Moscow, 109028
A. Yatskov
Ivannikov Institute for System Programming, Russian Academy of Sciences; Moscow State University
编辑信件的主要联系方式.
Email: yatskov@ispras.ru
俄罗斯联邦, Moscow, 109004; Moscow, 119991