- Cleaned up block learning
- Improved performance of connected components algorithm with very large components
- Fixed pickling unpickling bug of Index predicate classes
- Implemented a disagreement based active labeler to improve blocking recall
- removed shelve-backed persistence in blocking data in favor of an improved in-memory implementation
- matchBlocks is not a generator; match is now optionally a generator. If the generator option is turned of for the Gazette match is lazy
- Speed up blocking, on our way to 3-predicates
- Significantly reduced memory footprint during connected_components
- Significantly reduced memory footprint during scoreDuplicates
- Improper release
- TempShelve class that addresses various bugs related to cleaning up tempoary shelves
- Added
target
argument to blocker and predicates for changing the behavior of the predicates for the target and source dataset if we are linking.
- Use file-backed blocking with dbm, dramatically increases size of data that can be handled without special programming
- Reduce memory footprint of matching
- Simplify .train method
- Levenshtein search based index predicates thanks to @mattandahalfew
- simplified the sample API, this might be a breaking change for some
- the active learner interface is now more modular to allow for a different learner
- random sampling of pairs has been improved for linking case and dedupe case, h/t to @MarkusShepherd
- frozendicts have finally been removed
- first N char predicates return their entire length if length is less than N, instead of nothing
- crossvalidation is skipped in active learning if using default rlr learner
- Block indexes can now be persisted by using the index=True argument in the writeSettings method
- Now uses C version of double metaphone for speed
- Much faster compounding of blocks in block learning
- Block learning now tries to minimize the total number of comparisons not just the comparisons of distinct records. This decouples makes block learning from learning classifier learning. This change has requires new, different arguments to the train method.
- Console labeler now shows fields in the order they are defined in the data model. The labeler also reports number of labeled examples
pud
argument added to thetrain
method. Proportion of uncovered dupes. This deprecatesuncovered_dupes
argument
- If we have enough training data, consider Compound predicates of length 3 in addition to predicates of length 2
- None now treated as missing data indicator. Warnings for deprecations of older types of missing data indicators
Features
- Handle FuzzyCategoricalType in datamodel
Features
- Speed up learning
- Parallelize sampling
- Optional CRF Edit Distance
Support for Python 3.4 added. Support for Python 2.6 dropped.
Features
- Windows OS supported
- train method has argument for not considering index predicates
- TfIDFNGram Index Predicate added (for shorter string)
- SuffixArray Predicate
- Double Metaphone Predicates
- Predicates for numbers, OrderOfMagnitude, Round
- Set Predicate OrderOfCardinality
- Final, learned predicates list will now often be smaller without loss of coverage
- Variables refactored to support external extensions like https://github.com/datamade/dedupe-variable-address
- Categorical distance, regularized logistic regression, affine gap distance, canonicalization have been turned into separate libraries.
- Simplejson is now dependency
Features
- Individual record cluster membership scores
- New predicates
- New Exists Variable Type
Bug Fixes
- Latlong predicate fixed
- Set TFIDF canopy working properly
Features
- Sampling methods now use blocked sampling
Version 0.7.0 is backwards compatible, except for the match method of Gazetteer class
Features
- new index, unindex, and match methods in Gazetter Matching. Useful for streaming matching
Version 0.6.0 is not backwards compatible.
Features :
- new Text, ShortString, and exact string types
- multiple variables can be defined on same field
- new Gazette linker for matching dirty records against a master list
- performance improvements, particularly in memory usage
- canonicalize function in dedupe.convenience for creating a canonical representation of a cluster of records
- tons of bugfixes
API breaks
- when initializing an ActiveMatching object,
variable_definition
replacesfield_definition
and is a list of dictionaries instead of a dictionary. See the documentation for details - also when initializing a Matching object,
num_processes
has been replaced bynum_cores
, which now defaults to the number of cpus on the machine - when initializing a StaticMatching object,
settings_file
is now expected to be a file object not a string. ThereadTraining
,writeTraining
,writeSettings
methods also all now expect file objects
Version 0.5 is not backwards compatible.
Features :
- Special case code for linking two datasets that, individually are unique
- Parallel processing using python standard library multiprocessing
- Much faster canopy creation using zope.index
- Asynchronous active learning methods
API breaks :
duplicateClusters
has been removed, it has been replaced bymatch
andmatchBlocks
goodThreshold
has been removed, it has been replaced bythreshold
andthresholdBlocks
- the meaning of
train
has changed. To train from training file usereadTraining
. To use console labeling, pass a dedupe instance to theconsoleLabel
function - The convenience function dataSample has been removed. It has been replaced by
the
sample
methods - It is no longer necessary to pass
frozendicts
toMatching
classes blockingFunction
has been removed and been replaced by theblocker
method