Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Over-consolidation of Inventor ID #3

Open
laironald opened this issue Jan 30, 2013 · 1 comment
Open

Over-consolidation of Inventor ID #3

laironald opened this issue Jan 30, 2013 · 1 comment

Comments

@laironald
Copy link

I performed a simple check to test inventor names.
0 = i detect a middle name conflict
1 = a name is matched against a name without a middle name
2 = the names contain a middle name and the middle initial matches

nstr invs patents avgpats = patents/invs


0 8,644 176,307 20.39
1 1,611,032 6,279,199 3.89
2 1,480,296 3,955,163 2.67

For example:
The nstr=0 includes the following (first 10 entries):

(inv_id, #patents, unique names clumped together)
03858572-2|31|JOHN F DYE,JOHN DYE,JOHN D DYE
03858760-1|45|ANTONIN GONCALVES,ANTONIN L GONCALVES,ANTONIN C GONCALVES
03858787-3|19|ROGER M FLOYD,ROGER N FLOYD
03859063-2|8|STEVEN I TAUB,STEVEN L TAUB
03859092-1|42|HENRY J GYSLING,HENRY L GYSLING,HENRY JAMES GYSLING
03859097-1|4|FREDRICK L HAMB,FREDERICK L HAMB,FREDERICK T HAMB,FREDERICK D HAMB
03859113-2|18|WILLIAM C STUMPHAUZER,WILLIAM S STUMPHAUZER
03859119-1|316|JAMES C FLETCHER,JAMES ADMINISTR FLETCHER,JAMES CORVIN FLETCHER,J CLINT FLETCHER,J CLINTON FLETCHER
03859298-1|72|JOHN H SELLSTEDT,JOHN H SELLSTED,JOHN M SELLSTEDT
03859356-1|109|WILLIAM J HOULIHAN,WILLIAM H HOULIHAN

As you can see here, the first record --

  • John F Dye gets clumped with John D Dye which is clearly incorrect
  • Same idea for the remainder. The James Fletcher one is particularly concerning (316 patents) and looks to be at least 3+ individuals mashed together. (#03859298-1)

While this is a relatively small % of all inventors identified -- the avgpats for these individuals is extremely high compared to the others. I've run into these individuals when creating networks and they create some strange networks! That said, visually observing the data also presents some interesting blocking mechanisms for further disambiguation which I would love to share. I think the more we show these results visually via APIs, some data issues may become obvious.

@doolin
Copy link
Member

doolin commented Jan 30, 2013

Please post the code which exposed the bug so we can reproduce it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants