Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser: Ignore UTF-8 BOM characters #700

Merged
merged 1 commit into from
Aug 14, 2024

Conversation

vimpostor
Copy link
Contributor

@vimpostor vimpostor commented Aug 14, 2024

Some particularly badly written ics files can start with the UTF-8 BOM character.

While they are not strictly according to the spec, it is easy to add support for them.
Before this patch, icalendar would choke on these files with:

Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/icalendar/parser.py", line 339, in parts
    validate_token(name)
  File "/usr/lib/python3.12/site-packages/icalendar/parser.py", line 128, in validate_token
    raise ValueError(name)
ValueError: BEGIN

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/icalendar", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/lib/python3.12/site-packages/icalendar/cli.py", line 82, in main
    calendar = Calendar.from_ical(f.read())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/icalendar/cal.py", line 331, in from_ical
    name, params, vals = line.parts()
                         ^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/icalendar/parser.py", line 353, in parts
    raise ValueError(
ValueError: Content line could not be parsed into parts: 'BEGIN:VCALENDAR': BEGIN

After this patch, icalendar is able to parse these ics files.

Let me know, if you prefer another solution to this.


📚 Documentation preview 📚: https://icalendar--700.org.readthedocs.build/

The zero width No-Break space U+FEFF (also known as BOM) is not really
supposed to occur in ics files. However some particularly bad ics file
writers generate calendars with a BOM at the beginning.

Decoding with the encoding set to "utf-8-sig" is the easiest way to deal
with this [0], particularly because:
- If the file is BOM encoded UTF-8, then "utf-8" leaves the BOM in the
  content, while "utf-8-sig" transparently removes the BOM
- If the file is Standard UTF-8, then "utf-8-sig" decodes the same as
  "utf-8"

[0] https://stackoverflow.com/a/44573867
@coveralls
Copy link

Pull Request Test Coverage Report for Build 10394696884

Details

  • 4 of 5 (80.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.001%) to 97.493%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/icalendar/cli.py 0 1 0.0%
Totals Coverage Status
Change from base Build 10134522684: 0.001%
Covered Lines: 3178
Relevant Lines: 3256

💛 - Coveralls

@niccokunzmann niccokunzmann merged commit 37af7af into collective:main Aug 14, 2024
18 checks passed
@niccokunzmann
Copy link
Member

niccokunzmann commented Aug 14, 2024

Nice! I merge this right away! Thanks!

@vimpostor
Copy link
Contributor Author

Nice! I merge this right away! Thanks!

Thanks, that was fast! :)

@vimpostor vimpostor deleted the unicode-bom branch August 14, 2024 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants