Skip to content

Proposal for Groundrules

eli knaap edited this page Mar 1, 2019 · 1 revision

One way that python projects help smooth the user experience is to identify methods/conventions that all classes or estimators must have. For instance, the most well-known version of this is done by scikit-learn, but others exist in other packages as well, such as statsmodels or even spreg.

Usually, this behavior is achieved through class inheritance, but may be achieved much more generally through ensuring semantic consistency across the library. What I hope to do here is to provide a set of conventions, attributes, and methods that I think would ensure our user experience is consistent, simple to understand, and easy to integrate with other software packages. I'd also like to make sklearn a dependency.

  1. Vocabulary & ground rules
    • spatially-determined clusters are regions, which are composed of collections of observations.
    • a polygon that encloses all observations within a region is the hull for that region.
    • the individual numbers that map observations into regions are labels.
      • all labels are int types and start at zero.
      • observations that are disconnected or that are not assigned to regions should be labeled -1
    • estimated quantities that are stored on the object should end with a _ character.
      • for instance, the assignments made by the regionalizer should be called labels_
      • the data used for the regionalization should not be cached on the object, but derived properties (like an affinity matrix) can be.
    • return types should be flat numpy arrays.
  2. Borrowing from sklearn, the methods for all classes should probably be class Regionalizer(sklearn.base.BaseEstimator, sklearn.base.ClusterMixin) and follow a similar pattern:
    • __init__: the initialization of the estimator
      • only set attributes or configuration flags.
      • the number of regions to find should be called n_regions
        • when wanting a max-p-type solution--the largest number of feasible regions-- n_clusters=np.inf should be used.
        • when wanting an optimal number of clusters given a fit metric--when the number of clusters should be learned from the data--n_clusters=None should be used.
      • the connectivity matrix should be given as a connectivity argument, and should focus on scipy sparse matrices. We can build the W behind the scenes, but this lowers the barrier to folks outside of PySAL (e.g. networkx/osmnx)
    • fit(X,y=None):
      • this should ignore y. This is a convention in sklearn, but I'm open to just taking X.
      • this might clean up data and then pass it along to a function designed to regionalize on clean data. See dbscan for examples.
      • this should return self
    • fit_transform(X)
      • this should be implemented as return self.fit(X).labels_
  3. Possible "new" concepts for regionalization? open for definition (do not implement on classes)
    • update(new_X) not needed
      • modify existing partition to accommodate new data.
      • return labels for new data
    • assign(new_X) implement in a utility, because this does not pertain to the clustering algorithm directly.
      • assign observations to regions using the region hull
Clone this wiki locally