Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetaMap API breaks when special characters (e.g. 'ß') occurs in a word #8

Open
KimBenjaminTang opened this issue Dec 6, 2022 · 2 comments

Comments

@KimBenjaminTang
Copy link

Hello, I am trying to let MetaMap process some translated german texts, which include words with the letter 'ß'.

After analyzing why the JSON output breaks, I found out that the character 'ß' seems to cause an error, if it is included in a word (not a standalone character).

Example request:

from skr_web_api import Submission, METAMAP_INTERACTIVE_URL

args = "-AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase -Z 2022AA"
inst = Submission(email, apikey)
inst.init_mm_interactive('This is a test with Straße', args=args)
response = inst.submit()

When I decode the content of the response via response.content.decode(), it returns a broken JSON string (broken, since it does not clsoe at the end and seems cut off):

/dmzfiler/II_Group/MetaMap2020/public_mm/bin/SKRrun.20 /dmzfiler/II_Group/MetaMap2020/public_mm/bin/metamap20.BINARY.Linux --lexicon db -Z 2022AA --silent -AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase
{"AllDocuments":[
{
   "Document": {
     "CmdLine": {
       "Command": "metamap --lexicon db -Z 2022AA --silent -AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase",
       "Options": [
         {
           "OptName": "lexicon",
           "OptValue": "db"
         },
         {
           "OptName": "mm_data_year",
           "OptValue": "2022AA"
         },
         {
           "OptName": "silent"
         },
         {
           "OptName": "strict_model"
         },
         {
           "OptName": "show_cuis"
         },
         {
           "OptName": "restrict_to_sources",
           "OptValue": ["SNOMEDCT_US_2022_03_01"]
         },
         {
           "OptName": "JSONf",
           "OptValue": "2"
         },
         {
           "OptName": "mm_data_version",
           "OptValue": "USAbase"
         },
         {
           "OptName": "infile",
           "OptValue": "user_input"
         },
         {
           "OptName": "outfile",
           "OptValue": "user_output"
         }]
     },
     "AAs": [],
     "Negations": [],
     "Utterances": [
       {
         "PMID": "USER",
         "UttSection": "tx",
         "UttNum": "1",
         "UttText": [

Somewhat of fix would be possible by replacing the character 'ß' with 'ss' to avoid this issue, but I am not sure if the results will be the same as with the online version of MetaMap, since words containing 'ß' are not a problem there:

Request:

User Information: [email protected]
Run Time: 12/06/2022 06:12:29

MetaMap Version Used: metamap20
MetaMap Options: -A+ -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase
Knowledge Source Used: 2022AA

Input Text:

This is a test with Straße


Output:

{
   "Document": {
     "CmdLine": {
       "Command": "metamap --lexicon db -Z 2022AA -A+ -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase /usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:[email protected]_124752701.tmp /usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:[email protected]_124752701.out",
       "Options": [
         {
           "OptName": "lexicon",
           "OptValue": "db"
         },
         {
           "OptName": "mm_data_year",
           "OptValue": "2022AA"
         },
         {
           "OptName": "strict_model"
         },
         {
           "OptName": "bracketed_output"
         },
         {
           "OptName": "restrict_to_sources",
           "OptValue": ["SNOMEDCT_US_2022_03_01"]
         },
         {
           "OptName": "JSONf",
           "OptValue": "2"
         },
         {
           "OptName": "mm_data_version",
           "OptValue": "USAbase"
         },
         {
           "OptName": "infile",
           "OptValue": "/usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:[email protected]_124752701.tmp"
         },
         {
           "OptName": "outfile",
           "OptValue": "/usr/local/apache/htdocs/II/Scheduler/foo/inter_12062022_06:12:[email protected]_124752701.out"
         }]
     },
     "AAs": [],
     "Negations": [],
     "Utterances": [
       {
         "PMID": "inter_12062022_06:12:[email protected]_124752701.tmp",
         "UttSection": "tx",
         "UttNum": "1",
         "UttText": "This is a test with Straße",
         "UttStartPos": "0",
         "UttLength": "26",
         "Phrases": [
           {
             "PhraseText": "This",
             "SyntaxUnits": [
               {
                 "SyntaxType": "pron",
                 "LexMatch": "this",
                 "InputMatch": "This",
                 "LexCat": "pron",
                 "Tokens": ["this"]
               }],
             "PhraseStartPos": "0",
             "PhraseLength": "4",
             "Candidates": [],
             "Mappings": []
           },
           {
             "PhraseText": "is",
             "SyntaxUnits": [
               {
                 "SyntaxType": "aux",
                 "LexMatch": "is",
                 "InputMatch": "is",
                 "LexCat": "aux",
                 "Tokens": ["is"]
               }],
             "PhraseStartPos": "5",
             "PhraseLength": "2",
             "Candidates": [],
             "Mappings": []
           },
           {
             "PhraseText": "a test with Straße",
             "SyntaxUnits": [
               {
                 "SyntaxType": "det",
                 "LexMatch": "a",
                 "InputMatch": "a",
                 "LexCat": "det",
                 "Tokens": ["a"]
               },
               {
                 "SyntaxType": "head",
                 "LexMatch": "test",
                 "InputMatch": "test",
                 "LexCat": "noun",
                 "Tokens": ["test"]
               },
               {
                 "SyntaxType": "prep",
                 "LexMatch": "with",
                 "InputMatch": "with",
                 "LexCat": "prep",
                 "Tokens": ["with"]
               },
               {
                 "SyntaxType": "mod",
                 "InputMatch": "Straße",
                 "LexCat": "noun",
                 "Tokens": ["straße"]
               }],
             "PhraseStartPos": "8",
             "PhraseLength": "18",
             "Candidates": [],
             "Mappings": [
               {
                 "MappingScore": "-770",
                 "MappingCandidates": [
                   {
                     "CandidateScore": "-770",
                     "CandidateCUI": "C0022885",
                     "CandidateMatched": "Laboratory procedures",
                     "CandidatePreferred": "Laboratory Procedures",
                     "MatchedWords": ["test"],
                     "SemTypes": ["lbpr"],
                     "MatchMaps": [
                       {
                         "TextMatchStart": "2",
                         "TextMatchEnd": "2",
                         "ConcMatchStart": "1",
                         "ConcMatchEnd": "1",
                         "LexVariation": "0"
                       }],
                     "IsHead": "yes",
                     "IsOverMatch": "no",
                     "Sources": ["SNOMEDCT_US"],
                     "ConceptPIs": [
                       {
                         "StartPos": "10",
                         "Length": "4"
                       }],
                     "Status": "0",
                     "Negated": "0"
                   }]
               },
               {
                 "MappingScore": "-770",
                 "MappingCandidates": [
                   {
                     "CandidateScore": "-770",
                     "CandidateCUI": "C0392366",
                     "CandidateMatched": "Tests (qualifier value)",
                     "CandidatePreferred": "Tests (qualifier value)",
                     "MatchedWords": ["test"],
                     "SemTypes": ["inpr"],
                     "MatchMaps": [
                       {
                         "TextMatchStart": "2",
                         "TextMatchEnd": "2",
                         "ConcMatchStart": "1",
                         "ConcMatchEnd": "1",
                         "LexVariation": "0"
                       }],
                     "IsHead": "yes",
                     "IsOverMatch": "no",
                     "Sources": ["SNOMEDCT_US"],
                     "ConceptPIs": [
                       {
                         "StartPos": "10",
                         "Length": "4"
                       }],
                     "Status": "0",
                     "Negated": "0"
                   }]
               },
               {
                 "MappingScore": "-770",
                 "MappingCandidates": [
                   {
                     "CandidateScore": "-770",
                     "CandidateCUI": "C0456984",
                     "CandidateMatched": "Test finding",
                     "CandidatePreferred": "Test Result",
                     "MatchedWords": ["test"],
                     "SemTypes": ["lbtr"],
                     "MatchMaps": [
                       {
                         "TextMatchStart": "2",
                         "TextMatchEnd": "2",
                         "ConcMatchStart": "1",
                         "ConcMatchEnd": "1",
                         "LexVariation": "0"
                       }],
                     "IsHead": "yes",
                     "IsOverMatch": "no",
                     "Sources": ["SNOMEDCT_US"],
                     "ConceptPIs": [
                       {
                         "StartPos": "10",
                         "Length": "4"
                       }],
                     "Status": "0",
                     "Negated": "0"
                   }]
               }]
           }]
       }]
   }
 }
]}

Can this be fixed by adjusting the MetaMap API to match the procedure of the MetaMap Online version?

@KimBenjaminTang
Copy link
Author

The same is applicable with other special characters, such as ü,ö,ä.

And I don't exactly know how the strings are being processed, but "Croé T" breaks it too, while "Croé" or "Croe T" pass.


example_text = """Croé T"""
args = "-AI -R SNOMEDCT_US_2022_03_01 --JSONf 2 -V USAbase -Z 2022AA"
inst = Submission(email, apikey)
inst.init_mm_interactive(example_text, args=args)
response = inst.submit()

Breaking here refers to the incomplete JSON at the end, ending on "UttText": [

So this is also fixable by removing the "é" but perhaps it leads in some cases to a loss of valuable information.

@KimBenjaminTang KimBenjaminTang changed the title MetaMap API breaks when special character 'ß' occurs in a word MetaMap API breaks when special characters (e.g. 'ß') occurs in a word Dec 6, 2022
@KimBenjaminTang
Copy link
Author

KimBenjaminTang commented Dec 6, 2022

It also breaks with the String m² T due to the character ² followed by another character/word. If the string contains the ² at the end with nothing following other than a whitespace, it gets processed:

  • "m² T" fails
  • "m² " succeeds
  • "m²" succeeds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant