Nans in generated address data #112

taimans-git · 2024-12-09T14:07:48Z

Hello,

when i follow the generate_data notebook and create synthetic data, the generated fake values contain sometimes nan values for parts of STREET_ADDRESS entities.

Nan values are also in the exemplary data file synth_dataset_v2.json, e.g.

{
"full_text": "I'll meet you at 25615 Tawastintie 72 Apt. 004\nKNIVSTA, nan 18237 after the concert.",
"masked": "I'll meet you at {{STREET_ADDRESS}} after the concert.",
"spans": [
{
"entity_type": "STREET_ADDRESS",
"entity_value": "25615 Tawastintie 72 Apt. 004\nKNIVSTA, nan 18237",
"start_position": 17,
"end_position": 65
}
],
"template_id": 46,
"metadata": null
}

Is this on purpose?

Thanks in advance :)

omri374 · 2024-12-09T15:20:58Z

This isn't on purpose. One of the generators (probably Faker?) generates nans sometimes.

Is it possible to create a reproducible example? We'll look into it

taimans-git · 2024-12-10T16:20:59Z

Thanks for your quick response!

Sure, for me, nans are included in a generated dataset when I run generate_data.ipynb as it originally exists in this repository.

Additionally, I have a general question regarding location detection. It seems to me that achieving a high recall rate in identifying complete address formats is quite complex. In addition to street names and cities in various languages, addresses often include building numbers, ZIP codes, and postcodes, which can be challenging to identify due to their frequently inconsistent formats.
Do you have any recommendations on how to improve the detection score for these address components?

omri374 · 2024-12-19T08:13:10Z

Address detection is quite challenging. There are some NER models who do a good job at detecting addresses as LOCATION. Due to irregularities in address formats, using rule-based approaches is challenging.

If you have an a-priori knowledge on the types of addresses you are expected to have in your data (e.g. from which countries), you could limit the formats by creating custom recognizers, for example, which only detect 5 digit zip codes and not other types, which might be common in other countries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nans in generated address data #112

Nans in generated address data #112

taimans-git commented Dec 9, 2024 •

edited

Loading

omri374 commented Dec 9, 2024

taimans-git commented Dec 10, 2024 •

edited

Loading

omri374 commented Dec 19, 2024

Nans in generated address data #112

Nans in generated address data #112

Comments

taimans-git commented Dec 9, 2024 • edited Loading

omri374 commented Dec 9, 2024

taimans-git commented Dec 10, 2024 • edited Loading

omri374 commented Dec 19, 2024

taimans-git commented Dec 9, 2024 •

edited

Loading

taimans-git commented Dec 10, 2024 •

edited

Loading