PPJoin

PPJoin and P4Join Python 3 implementation.

PPJoin

PPJoin stands for Position Prefix Join which is an efficient set similarity join algorithm using the Jaccard similarity with several filtering techniques.

PPJoin is introduced in

Xiao, Chuan, et al. "Efficient similarity joins for near-duplicate detection." ACM Transactions on Database Systems (TODS) 36.3 (2011): 1-41.

This implementation is based on https://github.com/teh/ppjoin.

join function takes a list of datasets from different parties and a threshold t as input. Each dataset is a list of records and each record is formed by list of tokens.

ppjoin.join(datasets: List[List[List[str]]], t: float) -> Set[Tuple[Tuple]]

The return will be a set of tuples and each tuple contains two inner tuples:

((dataset1 index, record index), (dataset2 index, record index))

Example:

from ppjoin import ppjoin

def tokenizer(record):
    return set(ppjoin.whitespace_tokenizer(record.lower()))


ds0 = ['a b d', 'a b c', 'h k']
ds1 = ['a b k', 'a b', 'h k', 'a c h']
ds2 = ['a c h']
ds = [
    [tokenizer(w) for w in ds0],
    [tokenizer(w) for w in ds1],
    [tokenizer(w) for w in ds2]
]


result = ppjoin.join(ds, t=0.5)

for r in result:
    ds1_id, r1id = r[0]
    ds2_id, r2id = r[1]
    print('Found pair: {} from dataset {}, {} from dataset {}'.format(
        ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id
    ))

Output:

Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b', 'k'] from dataset 1
Found pair: ['h', 'k'] from dataset 0, ['h', 'k'] from dataset 1
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 2
Found pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b', 'k'] from dataset 1
Found pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b'] from dataset 1
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 1
Found pair: ['a', 'c', 'h'] from dataset 1, ['a', 'c', 'h'] from dataset 2
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b'] from dataset 1

P4Join

P4Join (Privacy-Preserving Prefix Position Join) adapts PPJoin with bit operations to solve privacy-preserving record linkage problem. It supports length, prefix and an optimized position filter.

This is introduced in

Sehili, Ziad, et al. "Privacy preserving record linkage with PPJoin." Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015).

First step of P4Join is to encode original record into bit vector. encode_record takes a list of records each contains a list of tokens, a HMAC key, the length of the bit vector and k which indicates applying how many rounds of combined hash functions. The return of it is a list of encoded record vectors.

p4join.encode_record(record: List[List[str]], hmac_key: str, vec_len: int, k: int = 2) -> List[int]

P4Join's join function is similar to PPJoin's but takes encoded datasets as input. The return format is also identical to PPJoin.

p4join.join(datasets: List[List[int]], t: float = 0, vec_len: int = 0) -> Set[Tuple[Tuple]]

Example:

from ppjoin import ppjoin, p4join


def tokenizer(record):
    return set(ppjoin.whitespace_tokenizer(record.lower()))


hash_key = 'key'
vec_len = 40
k = 2

ds0 = ['a b d', 'a b c', 'h k']
ds1 = ['a b k', 'a b', 'h k', 'a c h']
ds2 = ['a c h']
ds = [
    [tokenizer(w) for w in ds0],
    [tokenizer(w) for w in ds1],
    [tokenizer(w) for w in ds2]
]

ds_encoded = [
    [p4join.encode_record(w, hash_key, vec_len, k) for w in d] for d in ds
]


result = p4join.join(ds_encoded, t=0.5, vec_len=vec_len)

for r in result:
    ds1_id, r1id = r[0]
    ds2_id, r2id = r[1]
    print('Found pair: {} from dataset {}, {} from dataset {}'.format(
        ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id
    ))

Installation

pip install -e .

Test

To run all unit tests:

python -m unittest discover ppjoin/tests

Tests on real world dataset Abt-Buy is from DBGroup of Leipzig.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
examples		examples
ppjoin		ppjoin
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PPJoin

PPJoin

P4Join

Installation

Test

About

Releases 2

Packages

Languages

License

usc-isi-i2/ppjoin

Folders and files

Latest commit

History

Repository files navigation

PPJoin

PPJoin

P4Join

Installation

Test

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages