BUG: Queen and Rook from_dataframe do not match docs #184

sjsrey · 2019-09-03T00:27:22Z

This address #183

jGaboardi

Nice. I was wondering about the Rook.from_dataframe and Queen.from_dataframe not matching.

As a side note, would it be possible to source the internals for the Rook and Queen classes from one set of methods? As is stands, aren't the Rook and Queen class methods identical?

sjsrey · 2019-09-03T03:02:41Z

Nice. I was wondering about the Rook.from_dataframe and Queen.from_dataframe not matching.

As a side note, would it be possible to source the internals for the Rook and Queen classes from one set of methods? As is stands, aren't the Rook and Queen class methods identical?

We need to hear from @ljwolf before doing much more as he was the original author and I'm not sure the changes reflect his original intent.

ljwolf · 2019-09-03T09:17:35Z

Doesn't this fail if idVariable is none but ids is passed?

ljwolf · 2019-09-03T09:30:45Z

In general, this works better. The behavior I was going for was (0) use the index if nothing else is provided, (1) use idvariable to build IDs if it was provided, (2) use IDs if provided and, (3) keep the order of the now-resolved IDs if id_order was not provided

This was awkward because of the id_order behavior and pysal's silent sorting of the input weights if you don't force their order. Id_order basically always needs to be set to either range(0,n) (to avoid silent lexicographic sorting of the IDs) or the user-specified order that's different from their input dataframe.

Splitting id_order from the order in which the inputs are provided adds so much complexity... Sorry I missed that twice...

jGaboardi

👍

ljwolf · 2019-09-03T12:17:19Z

Here's a bit of a test matrix.

I still think the main issue is that PySAL separates the ids from how the ids are ordered. We talked about this back in GSOC 2017. It should be the responsibility of stuff outside the class to make sure the input data is ordered correctly, because the worst surprise is that your weights are silently incorrect w.r.t. the ordering of your input data. By default, PySAL does this because (at the lowest level) weight keys are sorted lexicographically unless forced to retain the original ordering (docstring, location in code)

Sorting weight keys by default will make the weights out of alignment with the input dataframe unless the user knows this and presorts their data or happens to have it sorted by default. This will be a silent error, since the weights will still be of the right shape, just not representing the right linkages.

I raised this back in my GSOC, but we decided to keep it around, rather than remove it. I understand how this arose, but I still think the "right" fix is to just stop sorting things on input. Things become dramatically simpler, then:

take the ids from the index of the dataframe. If a user wants to use a variable, make them set that as the index on the dataframe explicitly
for collections without indices, use the ids in the order it's provided or default to range(0,n).

Quite a few PRs have attempted to address this inconsistency (#137, pysal#988, pysal#951 (refresh of pysal#922), pysal#946), and this solution, while a lot better than current, is still inconsistent, just within the Rook class:

import geopandas, numpy
from libpysal import weights, examples

full = geopandas.read_file(examples.get_path('south.shp'))
numpy.random.seed(11121)
part = full.sample(frac=.1)

nothing = weights.Rook.from_dataframe(part)
nothing.id_order == part.index # makes sense

ids_from_seq = weights.Rook.from_iterable(part.geometry, 
                                          ids=part.index.tolist())
print((ids_from_seq.full()[0] == nothing.full()[0]).all())
# False, because of sorting in seq, surprising.

fips_ids_from_df = weights.Rook.from_dataframe(part, 
                                               ids=part.FIPS.tolist())
fips_ids_from_seq = weights.Rook.from_iterable(part.geometry, 
                                               ids=part.FIPS.tolist())
fips_ids_from_seq.id_order # is a list of strings, makes sense
fips_ids_from_df.id_order # is part.index, surprising if you assume these will be *exactly* the input ids, rather than a sanitized id
(fips_ids_from_seq.full()[0] == fips_ids_from_df.full()[0]).all() 
# False, because of sorting in seq, surprising.


fips_ido_from_df = weights.Rook.from_dataframe(part, 
                                               id_order=part.FIPS.tolist())
fips_from_reindex = weights.Rook.from_dataframe(part.set_index('FIPS'))
fips_idvar_from_df = weights.Rook.from_dataframe(part,
                                                 idVariable='FIPS')
# fails!
#fips_ido_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                               id_order=part.FIPS.tolist())
# fails!
#fips_ido_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                              ids=part.FIPS.tolist(), 
#                                              id_order = list(range(part.shape[0])))
# fails!
#fips_ido_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                               ids=part.FIPS.tolist(), 
#                                               id_order = part.index.tolist())
# fails! 
#fips_idvar_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                                 idVariable='FIPS')

fips_idvar_from_seq = weights.Rook.from_iterable(part.set_index('FIPS').geometry)
fips_idvar_from_seq.id_order # is a list of integers, makes sense
fips_ido_from_df.id_order # is a list of integers, surprising
fips_idvar_from_df.id_order # integers in the order of the sorted idVariable!

## We never set id_order, so all string-keyed weights would be lexsorted according to docs

(fips_idvar_from_seq.full()[0] == fips_ido_from_df.full()[0]).all() 
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!
(fips_idvar_from_seq.full()[0] == fips_idvar_from_df.full()[0]).all() 
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!
(fips_from_reindex.full()[0] == fips_ido_from_df.full()[0]).all()
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!
(fips_ido_from_df.full()[0] == nothing.full()[0]).all()
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!

The issue is the surprise caused by sorting that forces the complexity up. now, you've always gotta set ids and id_order correctly and propagate them always.

ljwolf · 2020-01-03T17:17:04Z

I still think the "correct" solution here is to stop sorting string indices by default. When provided, take the index in the input order.

github-actions · 2020-03-04T00:03:15Z

Stale pull request message

BUG: Queen and Rook from_dataframe do not match docs

2fb85dd

sjsrey added the WIP Work in progress, do not merge. Discussion only. label Sep 3, 2019

sjsrey requested review from darribas, jGaboardi and ljwolf September 3, 2019 00:27

jGaboardi approved these changes Sep 3, 2019

View reviewed changes

jGaboardi added this to the next milestone Sep 3, 2019

jGaboardi mentioned this pull request Sep 3, 2019

Weights constructor behaving unexpectedly with dataframe IDs #183

Open

jGaboardi added the bug functionality that: returns invalid, erroneous, or meaningless results; or doesn't work at all. label Sep 3, 2019

BUG: DRY when setting ids from dataframe

7e6d37d

jGaboardi approved these changes Sep 3, 2019

View reviewed changes

ljwolf mentioned this pull request Jan 4, 2020

Rook & KNN take different id lists from the dataframe #102

Open

github-actions bot added the no-pr-activity label Mar 4, 2020

ljwolf mentioned this pull request Sep 29, 2021

Local Geary loses order pysal/esda#192

Closed

ljwolf added the weights label Oct 19, 2021

martinfleis mentioned this pull request Sep 3, 2022

API: Change the handling of IDs in from_dataframe constructors #477

Merged

martinfleis changed the base branch from master to main February 27, 2023 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Queen and Rook from_dataframe do not match docs #184

BUG: Queen and Rook from_dataframe do not match docs #184

sjsrey commented Sep 3, 2019

jGaboardi left a comment

sjsrey commented Sep 3, 2019

ljwolf commented Sep 3, 2019

ljwolf commented Sep 3, 2019

jGaboardi left a comment

ljwolf commented Sep 3, 2019 •

edited

Loading

ljwolf commented Jan 3, 2020

github-actions bot commented Mar 4, 2020

BUG: Queen and Rook from_dataframe do not match docs #184

Are you sure you want to change the base?

BUG: Queen and Rook from_dataframe do not match docs #184

Conversation

sjsrey commented Sep 3, 2019

jGaboardi left a comment

Choose a reason for hiding this comment

sjsrey commented Sep 3, 2019

ljwolf commented Sep 3, 2019

ljwolf commented Sep 3, 2019

jGaboardi left a comment

Choose a reason for hiding this comment

ljwolf commented Sep 3, 2019 • edited Loading

ljwolf commented Jan 3, 2020

github-actions bot commented Mar 4, 2020

ljwolf commented Sep 3, 2019 •

edited

Loading