From 140debd829b35494d6c6452514b81192846c1474 Mon Sep 17 00:00:00 2001 From: olga Date: Tue, 17 Dec 2019 12:07:15 -0800 Subject: [PATCH 1/2] - Updated README - Updated examples --- README.md | 134 ++++++++++------------------- material_parser/example.py | 6 +- material_parser/material_parser.py | 4 +- requirements.txt | 3 +- 4 files changed, 50 insertions(+), 97 deletions(-) diff --git a/README.md b/README.md index 924d437..47a81b8 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,16 @@ -# Material Parser +# MaterialParser -The class to extract composition of a given material +The class providing functionality to extract chemical data from a string of chemical terms/formulas/material names -**Material Parser** processes one material entity per run, and allows for: +This parser was created in order to address the problem of unification of materials entitites found in scieitific publications to facilitate text mining. - * parsing material string into composition, - * constructing dictionary of materials abbreviations, - * finding values of stoichiometric and elements variables, +**Material Parser** functionality includes: + + * converting chemical terms into chemical formula + * parsing chemical formula into composition, + * constructing dictionary of materials abbreviations from a text snippets, + * finding values of stoichiometric and elements variables froma a text snippets, * splitting mixtures/composites/alloys/solid solutions into compounds - * reconstructing chemical formula from chemical name ### Installation: ``` @@ -33,116 +35,68 @@ mp = MaterialParser(verbose=False, pubchem_lookup=False, fails_log=False) fails_log: outputs log of materials for which mp.parse_material failed (useful when long list of materials in processes) ``` -#### Methods to extract material composition +#### Primary functionality - * mp.parse_material(material_string) + * mp.parse_material_string(material_string) ``` - main method to parse material string into chemical structure and - convert chemical name into chemical formula - :param material_string: material name/formula - :return: dict(material_string: initial material string, - material_name: chemical name of material found in the string - material_formula: chemical formula of material - dopants: list of dopped materials/elements appeared in material string - phase: material phase appeared in material string - hydrate: if material is hydrate fraction of H2O - is_mixture: material is mixture/composite/alloy/solid solution - is_abbreviation: material is similar to abbreviation - fraction_vars: elements fraction variables and their values - elements_vars: elements variables and their values - composition: compound constitute of the material: composition (element: fraction) and - fraction of compound) + main method to compile string of chemical terms/formulas into data structure ``` - * mp.get_structure_by_formula(chemical_formula) + * mp.string2formula(material_string) + ``` + method to convert chemical name into formula + ``` + + * mp.formula2composition(chemical_formula) ``` - parsing chemical formula in composition - :param chemical_formula: chemical formula - :return: dict(formula: formula string corresponding to obtained composition - composition: element: fraction - fraction_vars: elements fraction variables: values - elements_vars: elements variables: values - hydrate: if material is hydrate fraction of H2O - phase: material phase appeared in formula - ) + method to compile chemical formula into data structure containing composition ``` -#### Methods to reconstruct chemical formula from material name +#### Auxiliary functions - * mp.split_material_name(material_string) + * mp.separate_additives(material_string) ``` - splitting material string into chemical name + chemical formula - :param material_string: in form of - "chemical name chemical formula"/"chemical name [chemical formula]" - :return: name: chemical name found in material string - formula: chemical formula found in material string - structure: output of get_structure_by_formula() + extracts snippets of the string recognized as doped elements, stabilizers, coatings, activators, etc. ``` - * mp.reconstruct_formula(material_name, valency='') + * mp.split_formula_into_compounds(material_string) ``` - reconstructing chemical formula for simple chemical names anion + cation - :param material_name: chemical name - :param valency: anion valency - :return: chemical formula + splits mixtures, alloys, composites, etc into list of constituting compounds with their fractions ``` - -#### Methods to simplify material string - - * mp.split_material(material_name) + * mp.get_species(material_string) ``` - splitting mixture/composite/solid solution/alloy into compound+fraction - :param material_name: material formula - :return: of : (compound, fraction) + extract species from material string ``` - * mp.get_dopants(material_name) +#### Additional functionality + + * mp.build_acronyms_dict(list_of_materials, text) ``` - resolving doped part in material string - :param material_name: material string - :return: of dopants, - new material name + constructs dictionary of acronyms based on provided list of materials strings and text ``` - * mp.reconstruct_list_of_materials(material_string) + + * mp.get_elements_values(variable, text) ``` - split material string into list of compounds - when it's given in form cation + several anions - for example: "oxides of manganese and lithium" - :param material_string: - :return: of chemical names + looks for the values of elements variables in the text ``` - * mp.cleanup_name(material_name) + * mp.get_stoichiometric_values(variable, text) ``` - cleaning up material name - fix due to tokenization imperfectness - :param material_name: material string - :return: updated material string + looks for the values of the variables for stoichiometric amounts in the text ``` -#### Methods to resolve abbreviations and variables - - * mp.build_abbreviations_dict(materials_list, sentences) + * mp.split_materials_list(material_string) ``` - constructing dictionary of abbreviations appeared in material list - :param paragraph: of sentences to look for abbreviations names - :param materials_list: of list of materials entities - :return: abbreviation: corresponding string + for material string in the format list of cations+anion, splits in into list of chemical names ``` - * mp.get_stoichiometric_values(var, sentence) + * mp.substitute_additives(list_of_additives, data_structure) ``` - find numeric values of var in sentence - :param var: variable name - :param sentence: sentence to look for - :return: : max_value: upper limit - min_value: lower limit - values: of numeric values + substitutes doped elements into original chemical formula to complete total stoiciometry to integer value ``` - * mp.get_elements_values(var, sentence): - ``` - find elements values for var in the sentence - :param var: variable name - :param sentence: sentence to look for - :return: of found values - ``` \ No newline at end of file +#### Citing + +If you use Material Parser in your work, please cite the following paper: + + * Kononova et. al "Text-mined dataset of inorganic materials synthesis recipes", Scientific Data 6 (1), 1-11 (2019) [10.1038/s41597-019-0224-1](https://www.nature.com/articles/s41597-019-0224-1) \ No newline at end of file diff --git a/material_parser/example.py b/material_parser/example.py index b3198c4..138cffe 100644 --- a/material_parser/example.py +++ b/material_parser/example.py @@ -7,20 +7,18 @@ mp = MaterialParser(pubchem_lookup=False, verbose=False) test_set = json.loads(open('test_data.json').read()) +print(len(test_set)) for item in test_set: material = item['material'] correct = item['parser_output'] - #print (material) - list_of_materials = mp.reconstruct_list_of_materials(material) + list_of_materials = mp.split_materials_list(material) list_of_materials = list_of_materials if list_of_materials != [] else [ material ] structure = [] for m in list_of_materials: structure.append(mp.parse_material_string(m)) - #pprint(structure) - #pprint(correct) if structure != correct: print("Mismatch for ", material) print('-'*40) diff --git a/material_parser/material_parser.py b/material_parser/material_parser.py index 0cb35fc..f19bec9 100644 --- a/material_parser/material_parser.py +++ b/material_parser/material_parser.py @@ -3,7 +3,7 @@ __author__ = "Olga Kononova" __maintainer__ = "Olga Kononova" __email__ = "0lgaGkononova@yandex.ru" -__version__ = "6.0.2" +__version__ = "6.0.3" import os import json @@ -19,7 +19,7 @@ class MaterialParser: def __init__(self, verbose=False, pubchem_lookup=False, fails_log=False, dictionary_update=False): - print("Initializing MaterialParser (forked) version 6.0.2") + print("Initializing MaterialParser version 6.0.3") self.__filename = os.path.dirname(os.path.realpath(__file__)) self.__pubchem_dictionary = json.loads(open(os.path.join(self.__filename, "rsc/pubchem_dict.json")).read()) diff --git a/requirements.txt b/requirements.txt index d1a88c7..8d2ad54 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,5 @@ regex pubchempy sympy -spacy \ No newline at end of file +spacy +json \ No newline at end of file From 1bdfd5267ed39e599662c7618be3105464cafe98 Mon Sep 17 00:00:00 2001 From: olga Date: Tue, 17 Dec 2019 12:13:55 -0800 Subject: [PATCH 2/2] - Bug fix in README --- README.md | 46 +++++++++++++++++++++++----------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 47a81b8..eaf5c91 100644 --- a/README.md +++ b/README.md @@ -37,62 +37,62 @@ mp = MaterialParser(verbose=False, pubchem_lookup=False, fails_log=False) #### Primary functionality - * mp.parse_material_string(material_string) - ``` - main method to compile string of chemical terms/formulas into data structure + * Main method to compile string of chemical terms/formulas into data structure + ``` + mp.parse_material_string(material_string) ``` - * mp.string2formula(material_string) + * Method to convert chemical name into formula ``` - method to convert chemical name into formula + mp.string2formula(material_string) ``` - * mp.formula2composition(chemical_formula) + * Method to compile chemical formula into data structure containing composition ``` - method to compile chemical formula into data structure containing composition + mp.formula2composition(chemical_formula) ``` #### Auxiliary functions - * mp.separate_additives(material_string) + * Extracting snippets of the string recognized as doped elements, stabilizers, coatings, activators, etc. ``` - extracts snippets of the string recognized as doped elements, stabilizers, coatings, activators, etc. + mp.separate_additives(material_string) ``` - * mp.split_formula_into_compounds(material_string) + * Spliting mixtures, alloys, composites, etc into list of constituting compounds with their fractions ``` - splits mixtures, alloys, composites, etc into list of constituting compounds with their fractions + mp.split_formula_into_compounds(material_string) ``` - * mp.get_species(material_string) + * Extracting species from material string ``` - extract species from material string + mp.get_species(material_string) ``` #### Additional functionality - * mp.build_acronyms_dict(list_of_materials, text) + * Constructing dictionary of acronyms based on provided list of materials strings and text ``` - constructs dictionary of acronyms based on provided list of materials strings and text + mp.build_acronyms_dict(list_of_materials, text) ``` - * mp.get_elements_values(variable, text) + * Looking for the values of elements variables in the text ``` - looks for the values of elements variables in the text + mp.get_elements_values(variable, text) ``` - * mp.get_stoichiometric_values(variable, text) + * Looking for the values of the variables for stoichiometric amounts in the text ``` - looks for the values of the variables for stoichiometric amounts in the text + mp.get_stoichiometric_values(variable, text) ``` - * mp.split_materials_list(material_string) + * Spliting in into list of chemical names material string in the format list of cations+anion ``` - for material string in the format list of cations+anion, splits in into list of chemical names + mp.split_materials_list(material_string) ``` - * mp.substitute_additives(list_of_additives, data_structure) + * Substituting doped elements into original chemical formula to complete total stoiciometry to integer value ``` - substitutes doped elements into original chemical formula to complete total stoiciometry to integer value + mp.substitute_additives(list_of_additives, data_structure) ``` #### Citing