-
Notifications
You must be signed in to change notification settings - Fork 46
Proposal for Groundrules
eli knaap edited this page Mar 1, 2019
·
1 revision
One way that python projects help smooth the user experience is to identify methods/conventions that all classes or estimators must have. For instance, the most well-known version of this is done by scikit-learn
, but others exist in other packages as well, such as statsmodels
or even spreg
.
Usually, this behavior is achieved through class inheritance, but may be achieved much more generally through ensuring semantic consistency across the library. What I hope to do here is to provide a set of conventions, attributes, and methods that I think would ensure our user experience is consistent, simple to understand, and easy to integrate with other software packages. I'd also like to make sklearn
a dependency.
- Vocabulary & ground rules
- spatially-determined clusters are
regions
, which are composed of collections ofobservations
. - a polygon that encloses all observations within a region is the
hull
for that region. - the individual numbers that map observations into regions are
labels
.- all
labels
areint
types and start at zero. - observations that are disconnected or that are not assigned to regions should be labeled
-1
- all
- estimated quantities that are stored on the object should end with a
_
character.- for instance, the assignments made by the regionalizer should be called
labels_
- the data used for the regionalization should not be cached on the object, but derived properties (like an affinity matrix) can be.
- for instance, the assignments made by the regionalizer should be called
- return types should be flat numpy arrays.
- spatially-determined clusters are
- Borrowing from
sklearn
, the methods for all classes should probably beclass Regionalizer(sklearn.base.BaseEstimator, sklearn.base.ClusterMixin)
and follow a similar pattern:-
__init__
: the initialization of the estimator- only set attributes or configuration flags.
- the number of regions to find should be called
n_regions
- when wanting a max-p-type solution--the largest number of feasible regions--
n_clusters=np.inf
should be used. - when wanting an optimal number of clusters given a fit metric--when the number of clusters should be learned from the data--
n_clusters=None
should be used.
- when wanting a max-p-type solution--the largest number of feasible regions--
- the connectivity matrix should be given as a
connectivity
argument, and should focus on scipy sparse matrices. We can build theW
behind the scenes, but this lowers the barrier to folks outside of PySAL (e.g. networkx/osmnx)
-
fit(X,y=None)
:- this should ignore
y
. This is a convention insklearn
, but I'm open to just takingX
. - this might clean up data and then pass it along to a function designed to regionalize on clean data. See dbscan for examples.
- this should return
self
- this should ignore
-
fit_transform(X)
- this should be implemented as
return self.fit(X).labels_
- this should be implemented as
-
- Possible "new" concepts for regionalization? open for definition (do not implement on classes)
-
not neededupdate(new_X)
- modify existing partition to accommodate new data.
- return labels for new data
-
assign(new_X)
implement in a utility, because this does not pertain to the clustering algorithm directly.- assign observations to regions using the region hull
-