<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root>
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" article-type="research-article" dtd-version="1.2" xml:lang="en"><front><journal-meta><journal-id journal-id-type="publisher-id">Computational nanotechnology</journal-id><journal-title-group><journal-title xml:lang="en">Computational nanotechnology</journal-title><trans-title-group xml:lang="ru"><trans-title>Computational nanotechnology</trans-title></trans-title-group></journal-title-group><issn publication-format="print">2313-223X</issn><issn publication-format="electronic">2587-9693</issn><publisher><publisher-name xml:lang="en">YUR-VAK</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="publisher-id">380182</article-id><article-id pub-id-type="doi">10.33693/2313-223X-2025-12-4-13-19</article-id><article-id pub-id-type="edn">FLJQXL</article-id><article-categories><subj-group subj-group-type="toc-heading" xml:lang="en"><subject>MATHEMATICAL MODELING, NUMERICAL METHODS AND COMPLEX PROGRAMS</subject></subj-group><subj-group subj-group-type="toc-heading" xml:lang="ru"><subject>МАТЕМАТИЧЕСКОЕ МОДЕЛИРОВАНИЕ, ЧИСЛЕННЫЕ МЕТОДЫ И КОМПЛЕКСЫ ПРОГРАММ</subject></subj-group><subj-group subj-group-type="article-type"><subject>Research Article</subject></subj-group></article-categories><title-group><article-title xml:lang="en">Triplet-based knowledge mining using pretrained large language models</article-title><trans-title-group xml:lang="ru"><trans-title>Извлечение знаний в формате триплетов с использованием дообученных больших языковых моделей</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><contrib-id contrib-id-type="orcid">https://orcid.org/0009-0000-7633-7302</contrib-id><contrib-id contrib-id-type="spin">8905-0826</contrib-id><name-alternatives><name xml:lang="en"><surname>Zinnurov</surname><given-names>Bulat R.</given-names></name><name xml:lang="ru"><surname>Зиннуров</surname><given-names>Булат Ринатович</given-names></name></name-alternatives><address><country country="RU">Russian Federation</country></address><bio xml:lang="en"><p>postgraduate student, Department of Automated Systems for Information Processing and Control</p></bio><bio xml:lang="ru"><p>аспирант, кафедра автоматизированных систем обработки информации и управления (АСОИУ)</p></bio><email>b.zinnurov@yandex.ru</email><xref ref-type="aff" rid="aff1"/></contrib><contrib contrib-type="author"><contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-0571-5593</contrib-id><contrib-id contrib-id-type="scopus">56165279600</contrib-id><contrib-id contrib-id-type="researcherid">E-8566- 2017</contrib-id><contrib-id contrib-id-type="spin">6882-0089</contrib-id><name-alternatives><name xml:lang="en"><surname>Gizatullin</surname><given-names>Zinnur M.</given-names></name><name xml:lang="ru"><surname>Гизатуллин</surname><given-names>Зиннур Марселевич</given-names></name></name-alternatives><address><country country="RU">Russian Federation</country></address><bio xml:lang="en"><p>Dr. Sci. (Eng.), Professor, Professor, Department of Automated Systems for Information Processing and Control</p></bio><bio xml:lang="ru"><p>доктор технических наук, профессор, кафедра автоматизированных систем обработки информации и управления (АСОИУ)</p></bio><email>zmgizatullin@kai.ru</email><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff-alternatives id="aff1"><aff><institution xml:lang="en">Kazan National Research Technical University named after A.N. Tupolev – KAI</institution></aff><aff><institution xml:lang="ru">Казанский национальный исследовательский технический университет им. А.Н. Туполева – КАИ</institution></aff></aff-alternatives><pub-date date-type="pub" iso-8601-date="2025-12-12" publication-format="electronic"><day>12</day><month>12</month><year>2025</year></pub-date><volume>12</volume><issue>4</issue><issue-title xml:lang="en">Computational nanotechnology</issue-title><issue-title xml:lang="ru">Computational nanotechnology</issue-title><fpage>13</fpage><lpage>19</lpage><history><date date-type="received" iso-8601-date="2026-02-02"><day>02</day><month>02</month><year>2026</year></date></history><permissions><copyright-statement xml:lang="en">Copyright ©; 2025, Yur-VAK</copyright-statement><copyright-statement xml:lang="ru">Copyright ©; 2025, Юр-ВАК</copyright-statement><copyright-year>2025</copyright-year><copyright-holder xml:lang="en">Yur-VAK</copyright-holder><copyright-holder xml:lang="ru">Юр-ВАК</copyright-holder><license><ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/">https://www.urvak.ru/contacts/</ali:license_ref></license></permissions><self-uri xlink:href="https://journals.rcsi.science/2313-223X/article/view/380182">https://journals.rcsi.science/2313-223X/article/view/380182</self-uri><abstract xml:lang="en"><p>Extracting structured information from text is a key task in natural language processing. Large language models for information extraction tasks achieve high accuracy thanks to pre-training on huge volumes of data. However, such models require significant computational resources and are unavailable for local use due to their dependence on cloud infrastructure. Therefore, compact, open-source large language models that can be retrained locally are increasingly being used to address this problem. This paper evaluates the effectiveness of retraining compact large language models for automated triplet information extraction from unstructured text. The Mistral model with seven billion parameters was used in the study. The model was fine-tuned on a custom dataset consisting of 650 examples, each containing an instruction, an input text and an expected output. The results confirm the effectiveness of retraining: the F1-score increased several-fold compared to the baseline model. The retrained version of the model demonstrates competitiveness with the large-scale DeepSeek language model with 685 billion parameters. The obtained results highlight the potential of compact open large language models for knowledge extraction tasks under resource constraints, such as knowledge graph construction.</p></abstract><trans-abstract xml:lang="ru"><p>Извлечение структурированной информации из текста является одной из ключевых задач в области обработки естественного языка. Большие языковые модели в задачах извлечения информации достигают высокой точности благодаря предобучению на огромных объемах данных. Однако такие модели требуют значительных вычислительных ресурсов и недоступны для локального использования из-за зависимости от облачной инфраструктуры. Поэтому в настоящее время для решения этой проблемы, все чаще используют компактные открытые большие языковые модели, которые можно дообучить локально. Цель работы – сравнительная оценка эффективности дообучения компактных больших языковых моделей для автоматизированного извлечения информации в формате триплетов из неструктурированного текста. В работе использовалась модель Mistral с семи млрд параметров. Дообучение модели было проведено на собственном наборе данных, состоящем из 650 примеров, где каждая запись содержала инструкцию, входной текст и ожидаемый ответ. Полученные результаты подтверждают эффективность дообучения: критерий F1-мера вырос в разы в сравнении с базовой моделью. Дообученная версия модели демонстрирует конкурентоспособность с крупной большой языковой моделью DeepSeek с 685 млрд параметров. Полученные результаты подчеркивают потенциал компактных открытых больших языковых моделей для задач извлечения знаний в условиях ограниченных ресурсов, например, для задачи построения графов знаний.</p></trans-abstract><kwd-group xml:lang="en"><kwd>large language model</kwd><kwd>retraining</kwd><kwd>instruction tuning</kwd><kwd>triplet extraction</kwd><kwd>knowledge graph</kwd></kwd-group><kwd-group xml:lang="ru"><kwd>большая языковая модель</kwd><kwd>дообучение</kwd><kwd>настройка инструкций</kwd><kwd>извлечение триплетов</kwd><kwd>граф знаний</kwd></kwd-group><funding-group><funding-statement xml:lang="en">The study was conducted within the framework of the Advanced Engineering School “Integrated Aviation Engineering” (Agreement 075-15-2025-129).</funding-statement><funding-statement xml:lang="ru">Исследование проведено в рамках Передовой инженерной школы «Комплексная авиационная инженерия» (Соглашение 075-15-2025-129).</funding-statement></funding-group></article-meta></front><body></body><back><ref-list><ref id="B1"><label>1.</label><citation-alternatives><mixed-citation xml:lang="en">Ivanova G.S., Martynyuk P.A. Analysis of systems for extracting information from unstructured text documents. Neurocomputers: Development, Application. 2025. Vol. 27. No. 1. Pp. 5–27. (In Rus.). DOI: 10.18127/j19998554-202501-01.</mixed-citation><mixed-citation xml:lang="ru">Иванова Г.С., Мартынюк П.А. Анализ систем извлечения информации из неструктурированных текстовых документов // Нейрокомпьютеры: разработка, применение. 2025. Т. 27. № 1. С. 5–27. DOI: 10.18127/j19998554-202501-01.</mixed-citation></citation-alternatives></ref><ref id="B2"><label>2.</label><citation-alternatives><mixed-citation xml:lang="en">Ma R., Wang P., Liu C. et al. S2R: Teaching LLMs to self-verify and self-correct via reinforcement learning. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vol. 1: Long papers. Vienna, Austria: Association for Computational Linguistics. Pp. 22632–22654. DOI: 10.18653/v1/2025.acl-long.1104.</mixed-citation><mixed-citation xml:lang="ru">Ma R., Wang P., Liu C. et al. S2R: Teaching LLMs to self-verify and self-correct via reinforcement learning // Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Vol. 1: Long papers. Vienna, Austria: Association for Computational Linguistics. Pp. 22632–22654. DOI: 10.18653/v1/2025.acl-long.1104.</mixed-citation></citation-alternatives></ref><ref id="B3"><label>3.</label><citation-alternatives><mixed-citation xml:lang="en">Ji S., Pan S., Cambria E. et al. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems. 2021. Vol. 33. No. 2. Pp. 494–514. DOI: 10.1109/TNNLS.2021.3070843.</mixed-citation><mixed-citation xml:lang="ru">Ji S., Pan S., Cambria E. et al. A survey on knowledge graphs: Representation, acquisition, and applications // IEEE Transactions on Neural Networks and Learning Systems. 2021. Vol. 33. No. 2. Pp. 494–514. DOI: 10.1109/TNNLS.2021.3070843.</mixed-citation></citation-alternatives></ref><ref id="B4"><label>4.</label><citation-alternatives><mixed-citation xml:lang="en">Procko T.T., Ochoa O. Graph retrieval-augmented generation for large language models: A survey. In: Proceedings of the Conference on AI, Science, Engineering, and Technology (AIxSET). Laguna Hills, CA, USA, 2024. Pp. 166–169. DOI: 10.1109/AIxSET62544.2024.00030.</mixed-citation><mixed-citation xml:lang="ru">Procko T.T., Ochoa O. Graph retrieval-augmented generation for large language models: A survey // Proceedings of the Conference on AI, Science, Engineering, and Technology (AIxSET). Laguna Hills, CA, USA, 2024. Pp. 166–169. DOI: 10.1109/AIxSET62544.2024.00030.</mixed-citation></citation-alternatives></ref><ref id="B5"><label>5.</label><citation-alternatives><mixed-citation xml:lang="en">Dagdelen J. et al. Structured information extraction from scientific text with large language models. Nature Communications. 2024. Vol. 15. Art. 1418. DOI: 10.1038/s41467-024-45563-x.</mixed-citation><mixed-citation xml:lang="ru">Dagdelen J. et al. Structured information extraction from scientific text with large language models // Nature Communications. 2024. Vol. 15. Art. 1418. DOI: 10.1038/s41467-024-45563-x.</mixed-citation></citation-alternatives></ref><ref id="B6"><label>6.</label><citation-alternatives><mixed-citation xml:lang="en">Chung H. W., Hou L., Longpre S. et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research. 2024. Vol. 25. No. 70. Pp. 1–53.</mixed-citation><mixed-citation xml:lang="ru">Chung H. W., Hou L., Longpre S. et al. Scaling instruction-finetuned language models // Journal of Machine Learning Research. 2024. Vol. 25. No. 70. Pp. 1–53.</mixed-citation></citation-alternatives></ref><ref id="B7"><label>7.</label><citation-alternatives><mixed-citation xml:lang="en">Hu E.J., Shen Y., Wallis P. et al. Lora: Low-rank adaptation of large language models. ICLR. 2022. Vol. 1. No. 2. URL: https://arxiv.org/abs/2106.09685</mixed-citation><mixed-citation xml:lang="ru">Hu E.J., Shen Y., Wallis P. et al. Lora: Low-rank adaptation of large language models // ICLR. 2022. Vol. 1. No. 2. URL: https://arxiv.org/abs/2106.09685</mixed-citation></citation-alternatives></ref><ref id="B8"><label>8.</label><citation-alternatives><mixed-citation xml:lang="en">Wang J. LLM-based fine-tuning data generation for relation triplet extraction with expert ensemble and demonstration selection. In: Proceedings of the IEEE 12th International Conference on Intelligent Systems (IS). IEEE, 2024. Pp. 1–7. DOI: 10.1109/IS61756.2024.10705209.</mixed-citation><mixed-citation xml:lang="ru">Wang J. LLM-based fine-tuning data generation for relation triplet extraction with expert ensemble and demonstration selection // IEEE 12th International Conference on Intelligent Systems (IS). IEEE, 2024. Pp. 1–7. DOI: 10.1109/IS61756.2024.10705209.</mixed-citation></citation-alternatives></ref><ref id="B9"><label>9.</label><citation-alternatives><mixed-citation xml:lang="en">Zhang Y., Sadler T., Taesiri M.R. et al. Fine-tuning language models for triple extraction with data augmentation. In: Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024). 2024. Pp. 116–124. DOI: 10.18653/v1/2024.kallm-1.12.</mixed-citation><mixed-citation xml:lang="ru">Zhang Y., Sadler T., Taesiri M.R. et al. Fine-tuning language models for triple extraction with data augmentation // Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024). 2024. Pp. 116–124. DOI: 10.18653/v1/2024.kallm-1.12.</mixed-citation></citation-alternatives></ref></ref-list></back></article>
