Skip to content

usc-isi-i2/ppjoin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PPJoin doi

PPJoin and P4Join Python 3 implementation.

PPJoin

PPJoin stands for Position Prefix Join which is an efficient set similarity join algorithm using the Jaccard similarity with several filtering techniques.

PPJoin is introduced in

Xiao, Chuan, et al. "Efficient similarity joins for near-duplicate detection." ACM Transactions on Database Systems (TODS) 36.3 (2011): 1-41.

This implementation is based on https://github.com/teh/ppjoin.

join function takes a list of datasets from different parties and a threshold t as input. Each dataset is a list of records and each record is formed by list of tokens.

ppjoin.join(datasets: List[List[List[str]]], t: float) -> Set[Tuple[Tuple]]

The return will be a set of tuples and each tuple contains two inner tuples:

((dataset1 index, record index), (dataset2 index, record index))

Example:

from ppjoin import ppjoin

def tokenizer(record):
    return set(ppjoin.whitespace_tokenizer(record.lower()))


ds0 = ['a b d', 'a b c', 'h k']
ds1 = ['a b k', 'a b', 'h k', 'a c h']
ds2 = ['a c h']
ds = [
    [tokenizer(w) for w in ds0],
    [tokenizer(w) for w in ds1],
    [tokenizer(w) for w in ds2]
]


result = ppjoin.join(ds, t=0.5)

for r in result:
    ds1_id, r1id = r[0]
    ds2_id, r2id = r[1]
    print('Found pair: {} from dataset {}, {} from dataset {}'.format(
        ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id
    ))

Output:

Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b', 'k'] from dataset 1
Found pair: ['h', 'k'] from dataset 0, ['h', 'k'] from dataset 1
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 2
Found pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b', 'k'] from dataset 1
Found pair: ['a', 'b', 'd'] from dataset 0, ['a', 'b'] from dataset 1
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'c', 'h'] from dataset 1
Found pair: ['a', 'c', 'h'] from dataset 1, ['a', 'c', 'h'] from dataset 2
Found pair: ['a', 'b', 'c'] from dataset 0, ['a', 'b'] from dataset 1

P4Join

P4Join (Privacy-Preserving Prefix Position Join) adapts PPJoin with bit operations to solve privacy-preserving record linkage problem. It supports length, prefix and an optimized position filter.

This is introduced in

Sehili, Ziad, et al. "Privacy preserving record linkage with PPJoin." Datenbanksysteme für Business, Technologie und Web (BTW 2015) (2015).

First step of P4Join is to encode original record into bit vector. encode_record takes a list of records each contains a list of tokens, a HMAC key, the length of the bit vector and k which indicates applying how many rounds of combined hash functions. The return of it is a list of encoded record vectors.

p4join.encode_record(record: List[List[str]], hmac_key: str, vec_len: int, k: int = 2) -> List[int]

P4Join's join function is similar to PPJoin's but takes encoded datasets as input. The return format is also identical to PPJoin.

p4join.join(datasets: List[List[int]], t: float = 0, vec_len: int = 0) -> Set[Tuple[Tuple]]

Example:

from ppjoin import ppjoin, p4join


def tokenizer(record):
    return set(ppjoin.whitespace_tokenizer(record.lower()))


hash_key = 'key'
vec_len = 40
k = 2

ds0 = ['a b d', 'a b c', 'h k']
ds1 = ['a b k', 'a b', 'h k', 'a c h']
ds2 = ['a c h']
ds = [
    [tokenizer(w) for w in ds0],
    [tokenizer(w) for w in ds1],
    [tokenizer(w) for w in ds2]
]

ds_encoded = [
    [p4join.encode_record(w, hash_key, vec_len, k) for w in d] for d in ds
]


result = p4join.join(ds_encoded, t=0.5, vec_len=vec_len)

for r in result:
    ds1_id, r1id = r[0]
    ds2_id, r2id = r[1]
    print('Found pair: {} from dataset {}, {} from dataset {}'.format(
        ds[ds1_id][r1id], ds1_id, ds[ds2_id][r2id], ds2_id
    ))

Installation

pip install -e .

Test

To run all unit tests:

python -m unittest discover ppjoin/tests

Tests on real world dataset Abt-Buy is from DBGroup of Leipzig.