-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src/visions/types/url.py passes non URLs #185
Comments
Hey @leapingllamas - thanks for your issue report. I'm still testing but this looks like both of those issues are a regression introduced here when we switched to backend specific implementations (pandas, numpy, python stdlib, etc...). If this is a blocker for you, try wrapping your test values into a pandas series and everything will work correctly again - i.e. >>> import pandas as pd
>>> x = pd.Series(urlparse(url) for url in urls)
>>> x in visions.URL
True The second issue >>> urlparse('junk') in visions.URL
True
>>> Is more interesting (and two separate questions). The first: technically this isn't a supported operation as urlparse('junk') isn't a sequence, however, the multidispatch library we use has dispatch resolution logic which looks for alternative methods which succeed on the input type.
It's basically finding the bugged python implementation above and caching it as an implementation on ParseResults. Probably something we should be aware of @sbrugman. The second question is philosophical - what is a URL? That's going to vary based on individual use-case, we are using ParseResult objects as our standard representation. There's no way for us to know whether you're using relative URLs without explicit netloc or scheme so we take your word for it by default, you gave us a ParseResult and we believe you. That being said, visions will handle raw strings and perform the parsing for you if you ask for it. The implementation we use actually ends up being basically the same as you proposed (with a few parallelization improvements for pandas) - you can find it here. In order to get there you have to go through a typeset though. This sequence urls = ['http://www.cwi.nl:80/%7Eguido/Python.html', 'https://github.com/pandas-profiling/pandas-profiling'] Is formally a sequence of strings which can be coerced to a URL (again, a URL is a ParseResult). Collections of types are combined into TypeSets connected through relations. >>> vis.URL.get_relations()
[IdentityRelation(related_type=Object, type=None, relationship=lambda, transformer=identity_transform, inferential=False), InferenceRelation(related_type=String, type=None, relationship=default_relation, transformer=identity_transform, inferential=True)] Although you can call those transformers directly from a type, it can be a little cumbersome as they are intended to be used from a typeset like this: >>> urls = ['http://www.cwi.nl:80/%7Eguido/Python.html', 'https://github.com/pandas-profiling/pandas-profiling']
>>> typeset = vis.CompleteSet()
>>> typeset.infer(urls)
((ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment=''), ParseResult(scheme='https', netloc='github.com', path='/pandas-profiling/pandas-profiling', params='', query='', fragment='')), [Generic, Object, String, URL], {}) The first result is the inferred representation for your URLs, the second is the sequence of Types used to infer the representation, and the third is a cached state you mostly won't need. |
I meant to add, one of the nice things about from typing import Any, Sequence
from multimethod import multimethod
from visions.relations import IdentityRelation, TypeRelation
from visions.types.string import String
from visions.types.type import VisionsBaseType
class MyURL(VisionsBaseType):
"""**Url** implementation of :class:`visions.types.type.VisionsBaseType`.
Examples:
>>> from urllib.parse import urlparse
>>> urls = ['http://www.cwi.nl:80/%7Eguido/Python.html', 'https://github.com/pandas-profiling/pandas-profiling']
>>> x = [urlparse(url) for url in urls]
>>> x in visions.URL
True
"""
@staticmethod
def get_relations() -> Sequence[TypeRelation]:
relations = [
IdentityRelation(String),
]
return relations
@staticmethod
@multimethod
def contains_op(item: Any, state: dict) -> bool:
try:
result = (urlparse(x) for x in item)
return all(all([res.scheme, res.netloc]) for res in result)
except:
return False Then, if you want to use it in a typeset all you have to do is typeset = vis.StandardSet() + MyUrl |
Bug fix is available at #187 @leapingllamas. If you are interested in contributing your own type implementations let me know and I'd be happy to get you started. |
src/visions/types/url.py does not correctly validate URLs.
First, the example code (lines 14--19) from the docs do not return True:
Second, non URLs are passing:
The code should probably check something like the following for each element of x:
Finally, and this is a suggested enhancement, I think the behavior would be more useful if it handled raw strings and did the parsing internally without the caller having to supply a parser:
The text was updated successfully, but these errors were encountered: