Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nans in generated address data #112

Open
taimans-git opened this issue Dec 9, 2024 · 3 comments
Open

Nans in generated address data #112

taimans-git opened this issue Dec 9, 2024 · 3 comments

Comments

@taimans-git
Copy link

taimans-git commented Dec 9, 2024

Hello,

when i follow the generate_data notebook and create synthetic data, the generated fake values contain sometimes nan values for parts of STREET_ADDRESS entities.

Nan values are also in the exemplary data file synth_dataset_v2.json, e.g.

{
"full_text": "I'll meet you at 25615 Tawastintie 72 Apt. 004\nKNIVSTA, nan 18237 after the concert.",
"masked": "I'll meet you at {{STREET_ADDRESS}} after the concert.",
"spans": [
{
"entity_type": "STREET_ADDRESS",
"entity_value": "25615 Tawastintie 72 Apt. 004\nKNIVSTA, nan 18237",
"start_position": 17,
"end_position": 65
}
],
"template_id": 46,
"metadata": null
}

Is this on purpose?

Thanks in advance :)

@omri374
Copy link
Contributor

omri374 commented Dec 9, 2024

This isn't on purpose. One of the generators (probably Faker?) generates nans sometimes.

Is it possible to create a reproducible example? We'll look into it

@taimans-git
Copy link
Author

taimans-git commented Dec 10, 2024

Thanks for your quick response!

Sure, for me, nans are included in a generated dataset when I run generate_data.ipynb as it originally exists in this repository.

Additionally, I have a general question regarding location detection. It seems to me that achieving a high recall rate in identifying complete address formats is quite complex. In addition to street names and cities in various languages, addresses often include building numbers, ZIP codes, and postcodes, which can be challenging to identify due to their frequently inconsistent formats.
Do you have any recommendations on how to improve the detection score for these address components?

@omri374
Copy link
Contributor

omri374 commented Dec 19, 2024

Address detection is quite challenging. There are some NER models who do a good job at detecting addresses as LOCATION. Due to irregularities in address formats, using rule-based approaches is challenging.

If you have an a-priori knowledge on the types of addresses you are expected to have in your data (e.g. from which countries), you could limit the formats by creating custom recognizers, for example, which only detect 5 digit zip codes and not other types, which might be common in other countries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants