Extraction of Data from Mass Media Web Sites

M. I. Varlamov; D. Yu. Turdakov; A. K. Yatskov

doi:10.1134/S0361768818050092

Extraction of Data from Mass Media Web Sites

Авторлар: Varlamov M.I.¹, Turdakov D.Y.¹^,2^,3, Yatskov A.K.¹^,2
Мекемелер:
1. Ivannikov Institute for System Programming, Russian Academy of Sciences
2. Moscow State University
3. National Research University—Higher School of Economics
Шығарылым: Том 44, № 5 (2018)
Беттер: 344-352
Бөлім: Article
URL: https://journals.rcsi.science/0361-7688/article/view/176663
DOI: https://doi.org/10.1134/S0361768818050092
ID: 176663

Дәйексөз келтіру

Толық мәтін

Ашық рұқсат
Рұқсат жабық

Рұқсат берілді
Рұқсат жабық

Тек жазылушылар үшін

Аннотация
Авторлар туралы
Әдебиет тізімі
Қосымша файлдар
Статистика

Аннотация

To understand the current state and dynamics of the development of the Internet information space, fast tools for extracting data for mass media sites that have a large degree of coverage are needed. However, by no means all sites provide data syndication in the RSS format, and the development of specialized tools for extracting data from each Web site is a costly procedure. In this paper, methods for automatic extraction of news texts from arbitrary mass media sites are proposed. Due to classification of Web page types and the subsequent grouping of their URLs, the quality of extracting news texts is improved. A strategy for traversing a site and detecting the pages containing hyperlinks to news pages is proposed. This strategy decreases the number of requests and reduces the site load.

Негізгі сөздер

Page News, News Texts, Content Extraction Method, Sitemap, Breadth-first Traversal

Қосымша файлдар

Әрекет

1. JATS XML

Жүктеу

Пайдаланушының аты
Құпиясөз
Мені есте сақтау

Құпия сөзді ұмыттыңыз ба?	Тіркеу

Пайдаланушының аты
Құпиясөз
Мені есте сақтау

Құпия сөзді ұмыттыңыз ба?	Тіркеу

Extraction of Data from Mass Media Web Sites

Толық мәтін

Аннотация

Негізгі сөздер

Авторлар туралы

M. Varlamov

D. Turdakov

A. Yatskov

Қосымша файлдар