Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

annotating choice elements #1

Open
sinairusinek opened this issue Jul 1, 2024 · 2 comments
Open

annotating choice elements #1

sinairusinek opened this issue Jul 1, 2024 · 2 comments

Comments

@sinairusinek
Copy link
Collaborator

sinairusinek commented Jul 1, 2024

Using the SOC python notebook, spacy did a good job annotating the following phrase with entities:
a letter from the King of <placeName type="gpe">Jerusalem</placeName>, i.e. <persName>John de Brienne</persName>

However, when we have a choice element that looks like this:
a letter from the King of Jerusalem, i.e. John de <choice><sic>Brinn</sic><corr>Brienne</corr></choice>

(see tei-c choice)

When stripping the elements and printing the plain text, SOC printed "John de BrinBrienne", and when it exported the xml at the end it went back to the proper choice structure. However, the entity John de Brienne was not annotated.

Any idea why?

I would expect the following result:

a letter from the King of <placeName type="gpe">Jerusalem</placeName>, i.e. <persName>John de <choice><sic>Brinn</sic><corr>Brienne</corr></choice></persName>

@millawell
Copy link
Collaborator

There are two parts to the answer:

  1. the standoff converter can exclude parts of the text based on the surrounding tags. There is a exclude_inside https://standoffconverter.readthedocs.io/en/latest/api.html#standoffconverter.View.exclude_inside to exclude all text inside any of the specific tags. In your case you might want the <sic> part not come up in the output plain text:
view = View(so).exclude_inside("{http://www.tei-c.org/ns/1.0}sic").shrink_whitespace()

Afterwards, the plain text looks better and spacy also recognizes the entity as PERSON.

  1. However, with depth=None the
so.add_inline(
                begin=start_ind,
                end=end_ind,
                tag=tags_dict[label]['tag'],
                depth=None,
                attrib=tags_dict[label]['attr']
            )

does not find a unique context (that's what your error message will print there).
That is because from the original TEI we see:

John de <choice><sic>brinn</sic><corr>Brienne</corr></choice>

the first part is at a certain depth

[<Element {http://www.tei-c.org/ns/1.0}text at 0x39fc80940>,
 <Element {http://www.tei-c.org/ns/1.0}front at 0x16ca1b940>,
 <Element {http://www.tei-c.org/ns/1.0}div at 0x16bf22100>,
 <Element {http://www.tei-c.org/ns/1.0}p at 0x39fc1eec0>]

and the Brienne is two depths further.

[<Element {http://www.tei-c.org/ns/1.0}text at 0x39fc80940>,
 <Element {http://www.tei-c.org/ns/1.0}front at 0x16ca1b940>,
 <Element {http://www.tei-c.org/ns/1.0}div at 0x16bf22100>,
 <Element {http://www.tei-c.org/ns/1.0}p at 0x39fc1eec0>,
 <Element {http://www.tei-c.org/ns/1.0}choice at 0x39faffbc0>,
 <Element {http://www.tei-c.org/ns/1.0}corr at 0x39faffd40>]

with depth=None, the standoff converter will try to add a tag at the deepest position which will fail because it would break the tree property of the XML:

<persName>John de <choice><sic>brinn</sic><corr>Brienne</persName></corr></choice>

So in this particular case, it would be possible to add it explicitly at depth 4:

so.add_inline( ..., depth=4, ... )

But as a more general approach, we could do an add_span here.

@ElectricFrogy
Copy link
Collaborator

Code has been fixed according to your explanation.

view = View(so).shrink_whitespace()
Was replaced with
view = View(so).exclude_inside("{http://www.tei-c.org/ns/1.0}sic").shrink_whitespace()

And this loop is now being used for the XML annotation:

`# Annotate the named entities in the XML content
for i, ent in enumerate(doc.ents):
start_ind = view.get_table_pos(ent.start_char)
end_ind = view.get_table_pos(ent.end_char)
label = ent.label_

print(f'{i} {start_ind=}\t{end_ind=}\t{label=}')

if label not in tags_dict.keys():
    print(label, '- not in dictionary -> IGNORED')
    continue
else:
    try:
        # Use the specified depth to avoid breaking the XML structure
        so.add_inline(
            begin=start_ind,
            end=end_ind,
            tag=tags_dict[label]['tag'],
            depth=4,  # Explicitly setting the depth to 4 as suggested
            attrib=tags_dict[label]['attr']
        )
    except Exception as e:
        print(e)`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants