Methods of extracting biomedical information from patents and scientific publications (on the example of chemical compounds)

Nikolay A. Kolpakov; Колпаков Н. А.; Alexey I. Molodchenkov; Молодченков А. И.; Anton V. Lukin; Лукин А. В.

doi:10.22363/2658-4670-2023-31-1-64-74

Methods of extracting biomedical information from patents and scientific publications (on the example of chemical compounds)

Авторлар: Kolpakov N.A.¹, Molodchenkov A.I.²^,3, Lukin A.V.²^,3
Мекемелер:
1. Moscow Institute of Physics and Technology (MIPT)
2. Federal research center “Computer science and control” of RAS
3. Peoples’ Friendship University of Russia (RUDN University)
Шығарылым: Том 31, № 1 (2023)
Беттер: 64-74
Бөлім: Articles
URL: https://journals.rcsi.science/2658-4670/article/view/315356
DOI: https://doi.org/10.22363/2658-4670-2023-31-1-64-74
EDN: https://elibrary.ru/VNWSXI
ID: 315356

Дәйексөз келтіру

Толық мәтін

Аннотация
Авторлар туралы
Әдебиет тізімі
Қосымша файлдар
Статистика

Аннотация

This article proposes an algorithm for solving the problem of extracting information from biomedical patents and scientific publications. The introduced algorithm is based on machine learning methods. Experiments were carried out on patents from the USPTO database. Experiments have shown that the best extraction quality was achieved by a model based on BioBERT.

Негізгі сөздер

machine learning, natural language processing, named entity recognition, biomedical texts processing

Авторлар туралы

Nikolay Kolpakov

Moscow Institute of Physics and Technology (MIPT)

Email: kolpakov.na@phystech.edu
ORCID iD: 0000-0002-1640-1357

Master’s degree student of Phystech School of Applied Mathematics and Informatics

9, Institutskiy Pereulok, Dolgoprudny, Moscow Region, 141700, Russian Federation

Alexey Molodchenkov

Federal research center “Computer science and control” of RAS; Peoples’ Friendship University of Russia (RUDN University)

Email: aim@tesyan.ru
ORCID iD: 0000-0003-0039-943X

Candidate of Technical Sciences, Federal Research Center “Computer Science and Control” of RAS employee, employee of the Peoples’ Friendship University of Russia

44-2, Vavilova St., Moscow, 119333, Russian Federation; 6, Miklukho-Maklaya St., Moscow, 117198, Russian Federation

Anton Lukin

Federal research center “Computer science and control” of RAS; Peoples’ Friendship University of Russia (RUDN University)

Хат алмасуға жауапты Автор.
Email: antonvlukin@gmail.com
ORCID iD: 0000-0003-4391-1958

Federal Research Center “Computer Science and Control” of RAS employee, employee of the Peoples’ Friendship University of Russia

44-2, Vavilova St., Moscow, 119333, Russian Federation; 6, Miklukho-Maklaya St., Moscow, 117198, Russian Federation

Әдебиет тізімі

S. A. Akhondi et al., “Automatic identification of relevant chemical compounds from patents,” Database: the journal of biological databases and curation, vol. 1, pp. 1-14, 2019. doi: 10.1093/database/baz001.
D. Jessop, S. Adams, E. Willighagen, L. Hawizy, and P. Murray-Rust, “OSCAR4: A flexible architecture for chemical textmining,” Journal of cheminformatics, vol. 3, no. 1, pp. 1-12, 2011. doi: 10.1186/17582946-3-41.
E. Soysal et al., “CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines,” Journal of the American Medical Informatics Association, vol. 25, no. 3, pp. 331-336, 2017. doi: 10.1093/jamia/ocx132.
M. Swain and J. Cole, “ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature,” Journal of Chemical Information and Modeling, vol. 56, no. 10, pp. 1894-1904, 2016. doi: 10.17863/CAM.10935.
J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. So, and J. Kang, “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics (Oxford, England), vol. 36, no. 4, pp. 1234- 1240, 2019. doi: 10.1093/bioinformatics/btz682.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998-6008, 2017.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: pretraining of deep bidirectional transformers for language understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171-4186, 2018. doi: 10.18653/v1/N19-1423.
The OpenNLP Project, http://opennlp.apache.org, Accessed: 202303-07.
CRFsuite: a Fast Implementation of Conditional Random Fields (CRFs), http://www.chokkan.org/software/crfsuite/, Accessed: 2023-0307.
J. M. Bernard, “Handling of Markush Structures,” Journal of chemical information and computer sciences, vol. 31, no. 1, pp. 64-68, 1991. doi: 10.1021/ci00001a010.
S. Heller, A. McNaught, I. Pletnev, S. Stein, and D. Tchekhovskoi, “The IUPAC International Chemical Identifier,” Journal of Cheminformatics, vol. 7, pp. 1-34, 2015. doi: 10.1186/s13321-015-0068-4.
USPTO, https://www.uspto.gov/patents, Accessed: 2023-03-07.
T. Mikolov, G. Corrado, K. Chen, and J. Dean, “Efficient estimation of word representations in vector space,” Proceedings of Workshop at ICLR, pp. 1-12, 2013.
T. Mikolov, W.-T. Yih, and G. Zweig, “Linguistic regularities in continuous space word representations,” Proceedings of NAACL-HLT, pp. 746- 751, 2013.
C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 03, pp. 273-297, 1995. doi: 10.1007/BF00994018.
J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems by Gibbs sampling,” Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370, 2005. DOI: 10.3115/ 1219840.1219885.
T. M. Mitchell, Machine learning. McGraw-Hill New York, 1997, 432 pp.

Қосымша файлдар

Әрекет

1. JATS XML

Жүктеу

Пайдаланушының аты
Құпиясөз
Мені есте сақтау

Құпия сөзді ұмыттыңыз ба?	Тіркеу

Пайдаланушының аты
Құпиясөз
Мені есте сақтау

Құпия сөзді ұмыттыңыз ба?	Тіркеу