-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nans in generated address data #112
Comments
This isn't on purpose. One of the generators (probably Faker?) generates nans sometimes. Is it possible to create a reproducible example? We'll look into it |
Thanks for your quick response! Sure, for me, nans are included in a generated dataset when I run generate_data.ipynb as it originally exists in this repository. Additionally, I have a general question regarding location detection. It seems to me that achieving a high recall rate in identifying complete address formats is quite complex. In addition to street names and cities in various languages, addresses often include building numbers, ZIP codes, and postcodes, which can be challenging to identify due to their frequently inconsistent formats. |
Address detection is quite challenging. There are some NER models who do a good job at detecting addresses as LOCATION. Due to irregularities in address formats, using rule-based approaches is challenging. If you have an a-priori knowledge on the types of addresses you are expected to have in your data (e.g. from which countries), you could limit the formats by creating custom recognizers, for example, which only detect 5 digit zip codes and not other types, which might be common in other countries. |
Hello,
when i follow the generate_data notebook and create synthetic data, the generated fake values contain sometimes nan values for parts of STREET_ADDRESS entities.
Nan values are also in the exemplary data file synth_dataset_v2.json, e.g.
{
"full_text": "I'll meet you at 25615 Tawastintie 72 Apt. 004\nKNIVSTA, nan 18237 after the concert.",
"masked": "I'll meet you at {{STREET_ADDRESS}} after the concert.",
"spans": [
{
"entity_type": "STREET_ADDRESS",
"entity_value": "25615 Tawastintie 72 Apt. 004\nKNIVSTA, nan 18237",
"start_position": 17,
"end_position": 65
}
],
"template_id": 46,
"metadata": null
}
Is this on purpose?
Thanks in advance :)
The text was updated successfully, but these errors were encountered: