Terminology Spectrum Analysis of Natural-Language Chemical Documents: Term-Like Phrases Retrieval Routine | Sciact - CRIS-система Института катализа

Terminology Spectrum Analysis of Natural-Language Chemical Documents: Term-Like Phrases Retrieval Routine Научная публикация

Журнал

Journal of Cheminformatics
ISSN: 1758-2946

Вых. Данные

Год: 2016, Том: 8, Номер: 1, Номер статьи : 22, Страниц : 17 DOI: 10.1186/s13321-016-0136-4

Ключевые слова

n-Gram analysis, Natural language text analysis, Term-like phrases retrieval, Terminology spectrum, Text information retrieval

Авторы

Alperin Boris L. ¹ , Kuzmin Andrey O. ^1,2 , Ilina Ludmila Yu. ¹ , Gusev Vladimir D. ³ , Salomatina Natalia V. ³ , Parmon Valentin N. ^1,2

Организации

1	Boreskov Institute of Catalysis SB RAS, Pr. Lavrentieva 5, Novosibirsk, Russia 630090
2	Novosibirsk State University, Pirogova 2, Novosibirsk, Russia 630090
3	Sobolev Institute of Mathematics SB RAS, Acad. Koptyug Avenue 4, Novosibirsk, Russia 630090

Информация о финансировании (1)

Федеральное агентство научных организаций России

V.46.4.4.

Background This study seeks to develop, test and assess a methodology for automatic extraction of a complete set of ‘term-like phrases’ and to create a terminology spectrum from a collection of natural language PDF documents in the field of chemistry. The definition of ‘term-like phrases’ is one or more consecutive words and/or alphanumeric string combinations with unchanged spelling which convey specific scientific meanings. A terminology spectrum for a natural language document is an indexed list of tagged entities including: recognized general scientific concepts, terms linked to existing thesauri, names of chemical substances/reactions and term-like phrases. The retrieval routine is based on n-gram textual analysis with a sequential execution of various ‘accept and reject’ rules with taking into account the morphological and structural information. Results The assessment of the retrieval process, expressed quantitatively with a precision (P), recall (R) and F1-measure, which are calculated manually from a limited set of documents (the full set of text abstracts belonging to 5 EuropaCat events were processed) by professional chemical scientists, has proved the effectiveness of the developed approach. The term-like phrase parsing efficiency is quantified with precision (P = 0.53), recall (R = 0.71) and F1-measure (F1 = 0.61) values. Conclusion The paper suggests using such terminology spectra to perform various types of textual analysis across document collections. This sort of the terminology spectrum may be successfully employed for text information retrieval, for reference database development, to analyze research trends in subject fields of research and to look for the similarity between documents.

Alperin B.L. , Kuzmin A.O. , Ilina L.Y. , Gusev V.D. , Salomatina N.V. , Parmon V.N.
Terminology Spectrum Analysis of Natural-Language Chemical Documents: Term-Like Phrases Retrieval Routine

Journal of Cheminformatics. 2016. V.8. N1. 22 :1-17. DOI: 10.1186/s13321-016-0136-4 WOS Scopus РИНЦ OpenAlex CAPlusCA PMID

Полный текст от издателя

Поступила в редакцию:	26 нояб. 2015 г.
Принята к публикации:	20 апр. 2016 г.
Опубликована online:	29 апр. 2016 г.
Опубликована в печати:	1 дек. 2016 г.

≡ Web of science:	WOS:000377061500001
≡ Scopus:	2-s2.0-84978128117
≡ РИНЦ:	27016401
≡ OpenAlex:	W2344511361
≡ Chemical Abstracts:	2017:744448
≡ Chemical Abstracts (print):	MEDLINE:27134681
≡ PMID (PubMed):	27134681

≡ Web of science	1	Сбор данных от 20.02.2026
≡ Scopus	3	Сбор данных от 22.02.2026
≡ РИНЦ	3	Сбор данных от 22.02.2026
≡ OpenAlex	4	Сбор данных от 22.02.2026