Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Queen and Rook from_dataframe do not match docs #184

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sjsrey
Copy link
Member

@sjsrey sjsrey commented Sep 3, 2019

This address #183

@sjsrey sjsrey added the WIP Work in progress, do not merge. Discussion only. label Sep 3, 2019
Copy link
Member

@jGaboardi jGaboardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I was wondering about the Rook.from_dataframe and Queen.from_dataframe not matching.

As a side note, would it be possible to source the internals for the Rook and Queen classes from one set of methods? As is stands, aren't the Rook and Queen class methods identical?

@jGaboardi jGaboardi added this to the next milestone Sep 3, 2019
@jGaboardi jGaboardi added the bug functionality that: returns invalid, erroneous, or meaningless results; or doesn't work at all. label Sep 3, 2019
@sjsrey
Copy link
Member Author

sjsrey commented Sep 3, 2019

Nice. I was wondering about the Rook.from_dataframe and Queen.from_dataframe not matching.

As a side note, would it be possible to source the internals for the Rook and Queen classes from one set of methods? As is stands, aren't the Rook and Queen class methods identical?

We need to hear from @ljwolf before doing much more as he was the original author and I'm not sure the changes reflect his original intent.

@ljwolf
Copy link
Member

ljwolf commented Sep 3, 2019

Doesn't this fail if idVariable is none but ids is passed?

@ljwolf
Copy link
Member

ljwolf commented Sep 3, 2019

In general, this works better. The behavior I was going for was (0) use the index if nothing else is provided, (1) use idvariable to build IDs if it was provided, (2) use IDs if provided and, (3) keep the order of the now-resolved IDs if id_order was not provided

This was awkward because of the id_order behavior and pysal's silent sorting of the input weights if you don't force their order. Id_order basically always needs to be set to either range(0,n) (to avoid silent lexicographic sorting of the IDs) or the user-specified order that's different from their input dataframe.

Splitting id_order from the order in which the inputs are provided adds so much complexity... Sorry I missed that twice...

Copy link
Member

@jGaboardi jGaboardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ljwolf
Copy link
Member

ljwolf commented Sep 3, 2019

Here's a bit of a test matrix.

I still think the main issue is that PySAL separates the ids from how the ids are ordered. We talked about this back in GSOC 2017. It should be the responsibility of stuff outside the class to make sure the input data is ordered correctly, because the worst surprise is that your weights are silently incorrect w.r.t. the ordering of your input data. By default, PySAL does this because (at the lowest level) weight keys are sorted lexicographically unless forced to retain the original ordering (docstring, location in code)

Sorting weight keys by default will make the weights out of alignment with the input dataframe unless the user knows this and presorts their data or happens to have it sorted by default. This will be a silent error, since the weights will still be of the right shape, just not representing the right linkages.

I raised this back in my GSOC, but we decided to keep it around, rather than remove it. I understand how this arose, but I still think the "right" fix is to just stop sorting things on input. Things become dramatically simpler, then:

  1. take the ids from the index of the dataframe. If a user wants to use a variable, make them set that as the index on the dataframe explicitly
  2. for collections without indices, use the ids in the order it's provided or default to range(0,n).

Quite a few PRs have attempted to address this inconsistency (#137, pysal#988, pysal#951 (refresh of pysal#922), pysal#946), and this solution, while a lot better than current, is still inconsistent, just within the Rook class:

import geopandas, numpy
from libpysal import weights, examples

full = geopandas.read_file(examples.get_path('south.shp'))
numpy.random.seed(11121)
part = full.sample(frac=.1)

nothing = weights.Rook.from_dataframe(part)
nothing.id_order == part.index # makes sense

ids_from_seq = weights.Rook.from_iterable(part.geometry, 
                                          ids=part.index.tolist())
print((ids_from_seq.full()[0] == nothing.full()[0]).all())
# False, because of sorting in seq, surprising.

fips_ids_from_df = weights.Rook.from_dataframe(part, 
                                               ids=part.FIPS.tolist())
fips_ids_from_seq = weights.Rook.from_iterable(part.geometry, 
                                               ids=part.FIPS.tolist())
fips_ids_from_seq.id_order # is a list of strings, makes sense
fips_ids_from_df.id_order # is part.index, surprising if you assume these will be *exactly* the input ids, rather than a sanitized id
(fips_ids_from_seq.full()[0] == fips_ids_from_df.full()[0]).all() 
# False, because of sorting in seq, surprising.


fips_ido_from_df = weights.Rook.from_dataframe(part, 
                                               id_order=part.FIPS.tolist())
fips_from_reindex = weights.Rook.from_dataframe(part.set_index('FIPS'))
fips_idvar_from_df = weights.Rook.from_dataframe(part,
                                                 idVariable='FIPS')
# fails!
#fips_ido_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                               id_order=part.FIPS.tolist())
# fails!
#fips_ido_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                              ids=part.FIPS.tolist(), 
#                                              id_order = list(range(part.shape[0])))
# fails!
#fips_ido_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                               ids=part.FIPS.tolist(), 
#                                               id_order = part.index.tolist())
# fails! 
#fips_idvar_from_seq = weights.Rook.from_iterable(part.geometry, 
#                                                 idVariable='FIPS')

fips_idvar_from_seq = weights.Rook.from_iterable(part.set_index('FIPS').geometry)
fips_idvar_from_seq.id_order # is a list of integers, makes sense
fips_ido_from_df.id_order # is a list of integers, surprising
fips_idvar_from_df.id_order # integers in the order of the sorted idVariable!

## We never set id_order, so all string-keyed weights would be lexsorted according to docs

(fips_idvar_from_seq.full()[0] == fips_ido_from_df.full()[0]).all() 
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!
(fips_idvar_from_seq.full()[0] == fips_idvar_from_df.full()[0]).all() 
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!
(fips_from_reindex.full()[0] == fips_ido_from_df.full()[0]).all()
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!
(fips_ido_from_df.full()[0] == nothing.full()[0]).all()
# True, because now the order is respected, surprising iff you assume keys are always lexsorted!

The issue is the surprise caused by sorting that forces the complexity up. now, you've always gotta set ids and id_order correctly and propagate them always.

@ljwolf
Copy link
Member

ljwolf commented Jan 3, 2020

I still think the "correct" solution here is to stop sorting string indices by default. When provided, take the index in the input order.

@github-actions
Copy link

github-actions bot commented Mar 4, 2020

Stale pull request message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug functionality that: returns invalid, erroneous, or meaningless results; or doesn't work at all. no-pr-activity weights WIP Work in progress, do not merge. Discussion only.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants