diff --git a/joss.05619/10.21105.joss.05619.crossref.xml b/joss.05619/10.21105.joss.05619.crossref.xml new file mode 100644 index 0000000000..de18777876 --- /dev/null +++ b/joss.05619/10.21105.joss.05619.crossref.xml @@ -0,0 +1,217 @@ + + + + 20231111T033521-f1baa724b16b67da3edc2fd44ec3831749ab952b + 20231111033521 + + JOSS Admin + admin@theoj.org + + The Open Journal + + + + + Journal of Open Source Software + JOSS + 2475-9066 + + 10.21105/joss + https://joss.theoj.org + + + + + 11 + 2023 + + + 8 + + 91 + + + + ER-Evaluation: End-to-End Evaluation of Entity +Resolution Systems + + + + Olivier + Binette + https://orcid.org/0000-0001-6009-5206 + + + Jerome P. + Reiter + https://orcid.org/0000-0002-8374-3832 + + + + 11 + 11 + 2023 + + + 5619 + + + 10.21105/joss.05619 + + + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + + + + Software archive + 10.5281/zenodo.10086102 + + + GitHub review issue + https://github.com/openjournals/joss-reviews/issues/5619 + + + + 10.21105/joss.05619 + https://joss.theoj.org/papers/10.21105/joss.05619 + + + https://joss.theoj.org/papers/10.21105/joss.05619.pdf + + + + + + (Almost) all of entity +resolution + Binette + Science Advances + 12 + 8 + 10.1126/sciadv.abi8021 + 2022 + Binette, O., & Steorts, R. C. +(2022). (Almost) all of entity resolution. Science Advances, 8(12), +eabi8021. https://doi.org/10.1126/sciadv.abi8021 + + + Estimating the performance of entity +resolution algorithms: Lessons learned through +PatentsView.org + Binette + The American Statistician + 4 + 77 + 10.1080/00031305.2023.2191664 + 2023 + Binette, O., York, S. A., Hickerson, +E., Baek, Y., Madhavan, S., & Jones, C. (2023). Estimating the +performance of entity resolution algorithms: Lessons learned through +PatentsView.org. The American Statistician, 77(4), 370–380. +https://doi.org/10.1080/00031305.2023.2191664 + + + PatentsView-Evaluation: Evaluation datasets +and tools to advance research on inventor name +disambiguation + Binette + arXiv e-prints + 10.48550/arXiv.2301.03591 + 2023 + Binette, O., Madhavan, S., Butler, +J., Card, B. A., Melluso, E., & Jones, C. (2023). +PatentsView-Evaluation: Evaluation datasets and tools to advance +research on inventor name disambiguation. arXiv e-Prints. +https://doi.org/10.48550/arXiv.2301.03591 + + + An end-to-end evaluation framework for entity +resolution systems with application to inventor name +disambiguation + Binette + 2023 + Binette, O., Baek, Y., Melluso, E., +Jones, C., Dasylva, A., & Reiter, J. P. (2023). An end-to-end +evaluation framework for entity resolution systems with application to +inventor name disambiguation. + + + Bridging the gap between reality and ideality +of entity matching: A revisting and benchmark +re-construction + Wang + Proceedings of the Thirty-First International +Joint Conference on Artificial Intelligence, IJCAI-22 + 10.24963/ijcai.2022/552 + 2022 + Wang, T., Lin, H., Fu, C., Han, X., +Sun, L., Xiong, F., Chen, H., Lu, M., & Zhu, X. (2022). Bridging the +gap between reality and ideality of entity matching: A revisting and +benchmark re-construction. In L. D. Raedt (Ed.), Proceedings of the +Thirty-First International Joint Conference on Artificial Intelligence, +IJCAI-22 (pp. 3978–3984). International Joint Conferences on Artificial +Intelligence Organization. +https://doi.org/10.24963/ijcai.2022/552 + + + Data Matching: Concepts and Techniques for +Record Linkage, Entity Resolution, and Duplicate +Detection + Christen + 2012 + Christen, P. (2012). Data Matching: +Concepts and Techniques for Record Linkage, Entity Resolution, and +Duplicate Detection. Springer Publishing Company, +Incorporated. + + + In search of an entity resolution OASIS: +Optimal asymptotic sequential importance sampling + Marchant + Proc. VLDB Endow. + 11 + 10 + 10.14778/3137628.3137642 + 2017 + Marchant, N. G., & Rubinstein, B. +I. P. (2017). In search of an entity resolution OASIS: Optimal +asymptotic sequential importance sampling. Proc. VLDB Endow., 10(11), +1322–1333. +https://doi.org/10.14778/3137628.3137642 + + + The Four Generations of Entity +Resolution + Papadakis + 2021 + Papadakis, G., Ioannou, E., Thanos, +E., & Palpanas, T. (2021). The Four Generations of Entity +Resolution. Morgan & Claypool Publishers. + + + An overview of end-to-end entity resolution +for big data + Christophides + ACM Computing Surveys + 6 + 53 + 10.1145/3418896 + 2021 + Christophides, V., Efthymiou, V., +Palpanas, T., Papadakis, G., & Stefanidis, K. (2021). An overview of +end-to-end entity resolution for big data. ACM Computing Surveys, 53(6), +1–42. https://doi.org/10.1145/3418896 + + + + + + diff --git a/joss.05619/10.21105.joss.05619.jats b/joss.05619/10.21105.joss.05619.jats new file mode 100644 index 0000000000..c60832a24c --- /dev/null +++ b/joss.05619/10.21105.joss.05619.jats @@ -0,0 +1,330 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +5619 +10.21105/joss.05619 + +ER-Evaluation: End-to-End Evaluation of Entity Resolution +Systems + + + +https://orcid.org/0000-0001-6009-5206 + +Binette +Olivier + + +* + + +https://orcid.org/0000-0002-8374-3832 + +Reiter +Jerome P. + + + + + +Duke University, USA + + + + +* E-mail: + + +6 +5 +2023 + +8 +91 +5619 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +Python +Entity Resolution +Evaluation + + + + + + Summary +

Entity resolution (ER), also referred to as record linkage and + deduplication, is the process of identifying and matching distinct + representations of real-world entities across diverse data sources. It + plays a crucial role in data management, cleaning, and integration, + with applications such as assessing the accuracy of the decennial + census, detecting fraud, linking patient data in healthcare, and + extracting relationships in structured and unstructured data + (Binette + & Steorts, 2022; + Christen, + 2012; + Christophides + et al., 2021; + Papadakis + et al., 2021).

+

As ER techniques continue to evolve and improve, it is essential to + have an efficient and comprehensive evaluation framework to measure + their performance and compare different approaches. Despite the growth + of ER research, there remains a need for a unified evaluation + framework that can address challenges associated with ER system + evaluation, including accounting for sampling biases and managing + class imbalances. Otherwise, using naive clustering metrics and toy + benchmark datasets without a principled evaluation methodology + generally leads to over-optimistic results that can lead to + performance rank reversals and poor system design + (Binette, + York, et al., 2023; + Wang + et al., 2022).

+

ER-Evaluation is a Python 3.7+ package designed to address these + challenges by implementing all components of a principled evaluation + framework for ER systems. It incorporates principled statistical + estimators for key performance metrics and summary statistics, error + analysis tools, data labeling tools, and data visualizations. The + package is written in Python with a simple architecture, ensuring + straightforward portability to other languages and frameworks when + necessary.

+

Additionally, ER-Evaluation adopts a novel entity-centric approach + that uses disambiguated entity clusters as the foundation for + analysis. This contrasts with traditional evaluation methods based on + labeling record pairs + (Marchant + & Rubinstein, 2017). The entity-centric approach + streamlines the utilization of existing benchmark datasets and the + labeling of new datasets without necessitating complex sampling + schemes. Furthermore, it enables the reuse of benchmark datasets at + all stages of the evaluation process, including for cluster-level + error analysis.

+
+ + Statement of need +

Entity resolution is a clustering problem characterized by small + and numerous clusters (up to millions or billions of clusters). + Researchers commonly evaluate the performance of entity resolution + systems by computing performance metrics (precision, recall, cluster + metrics) on relatively small benchmark datasets. However, this process + has been shown to yield biased and over-optimistic performance + assessments in ER, potentially leading to performance rank reversals + and poor system design + (Binette, + York, et al., 2023; + Wang + et al., 2022).

+

To address this issue, a new entity-centric methodology has been + proposed in Binette, York, et al. + (2023) + for obtaining accurate performance metric estimates based on small and + potentially biased benchmark datasets. The ER-Evaluation package + implements this methodology and numerous extensions to create a + comprehensive, end-to-end evaluation framework. It aims to streamline + the comparison of diverse ER techniques, assess their accuracy, and + ultimately accelerate the development and adoption of high-performing + ER systems. By integrating essential components such as data + preprocessing, error analysis, performance estimation, and + visualization functions, ER-Evaluation offers a user-friendly, + modular, and extensible interface for researchers and + practitioners.

+

The software is currently being used by PatentsView.org for the + evaluation of patent inventor name disambiguation + (Binette, + Madhavan, et al., 2023). The original methodology has been + published in Binette, York, et al. + (2023) + and extended methodology is under development in an upcoming article + (Binette, + Baek, et al., 2023).

+
+ + Acknowledgements +

We acknowledge financial support from the National Sciences and + Engineering Research Council of Canada and from the Fonds de Recherche + du Québec - Nature et Technologies.

+
+ + + + + + + BinetteOlivier + SteortsRebecca C + + (Almost) all of entity resolution + Science Advances + 2022 + 8 + 12 + 10.1126/sciadv.abi8021 + eabi8021 + + + + + + + BinetteOlivier + YorkSokhna A + HickersonEmma + BaekYoungsoo + MadhavanSarvo + JonesChristina + + Estimating the performance of entity resolution algorithms: Lessons learned through PatentsView.org + The American Statistician + Taylor & Francis + 2023 + 77 + 4 + 10.1080/00031305.2023.2191664 + 370 + 380 + + + + + + BinetteOlivier + MadhavanSarvo + ButlerJack + CardBeth Anne + MellusoEmily + JonesChristina + + PatentsView-Evaluation: Evaluation datasets and tools to advance research on inventor name disambiguation + arXiv e-prints + 2023 + 10.48550/arXiv.2301.03591 + + + + + + BinetteOlivier + BaekYoungsoo + MellusoEmily + JonesChristina + DasylvaAbel + ReiterJerome P + + An end-to-end evaluation framework for entity resolution systems with application to inventor name disambiguation + 2023 + + + + + + WangTianshu + LinHongyu + FuCheng + HanXianpei + SunLe + XiongFeiyu + ChenHui + LuMinlong + ZhuXiuwen + + Bridging the gap between reality and ideality of entity matching: A revisting and benchmark re-construction + Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22 + + RaedtLud De + + International Joint Conferences on Artificial Intelligence Organization + 2022 + 10.24963/ijcai.2022/552 + 3978 + 3984 + + + + + + ChristenPeter + + Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection + Springer Publishing Company, Incorporated + 2012 + + + + + + MarchantNeil G. + RubinsteinBenjamin I. P. + + In search of an entity resolution OASIS: Optimal asymptotic sequential importance sampling + Proc. VLDB Endow. + VLDB Endowment + 2017 + 10 + 11 + 10.14778/3137628.3137642 + 1322 + 1333 + + + + + + PapadakisGeorge + IoannouEkaterini + ThanosEmanouil + PalpanasThemis + + The Four Generations of Entity Resolution + Morgan & Claypool Publishers + 2021 + + + + + + ChristophidesVassilis + EfthymiouVasilis + PalpanasThemis + PapadakisGeorge + StefanidisKostas + + An overview of end-to-end entity resolution for big data + ACM Computing Surveys + 2021 + 53 + 6 + 10.1145/3418896 + 1 + 42 + + + + +
diff --git a/joss.05619/10.21105.joss.05619.pdf b/joss.05619/10.21105.joss.05619.pdf new file mode 100644 index 0000000000..6df094ef2e Binary files /dev/null and b/joss.05619/10.21105.joss.05619.pdf differ diff --git a/joss.05619/media/examples.png b/joss.05619/media/examples.png new file mode 100644 index 0000000000..6c10f8431b Binary files /dev/null and b/joss.05619/media/examples.png differ diff --git a/joss.05619/media/framework.drawio b/joss.05619/media/framework.drawio new file mode 100644 index 0000000000..f89a7b37d2 --- /dev/null +++ b/joss.05619/media/framework.drawio @@ -0,0 +1,61 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/joss.05619/media/framework.png b/joss.05619/media/framework.png new file mode 100644 index 0000000000..71274168d8 Binary files /dev/null and b/joss.05619/media/framework.png differ diff --git a/joss.05619/media/plot_comparison.png b/joss.05619/media/plot_comparison.png new file mode 100644 index 0000000000..2a0aae476d Binary files /dev/null and b/joss.05619/media/plot_comparison.png differ diff --git a/joss.05619/media/plot_decisiontree.png b/joss.05619/media/plot_decisiontree.png new file mode 100644 index 0000000000..22f9cf231e Binary files /dev/null and b/joss.05619/media/plot_decisiontree.png differ diff --git a/joss.05619/media/plot_estimates.png b/joss.05619/media/plot_estimates.png new file mode 100644 index 0000000000..f794e8994f Binary files /dev/null and b/joss.05619/media/plot_estimates.png differ diff --git a/joss.05619/media/plot_summaries.png b/joss.05619/media/plot_summaries.png new file mode 100644 index 0000000000..bbce36fcbb Binary files /dev/null and b/joss.05619/media/plot_summaries.png differ