Skip to content

Commit

Permalink
parser: Ignore UTF-8 BOM characters
Browse files Browse the repository at this point in the history
The zero width No-Break space U+FEFF (also known as BOM) is not really
supposed to occur in ics files. However some particularly bad ics file
writers generate calendars with a BOM at the beginning.

Decoding with the encoding set to "utf-8-sig" is the easiest way to deal
with this [0], particularly because:
- If the file is BOM encoded UTF-8, then "utf-8" leaves the BOM in the
  content, while "utf-8-sig" transparently removes the BOM
- If the file is Standard UTF-8, then "utf-8-sig" decodes the same as
  "utf-8"

[0] https://stackoverflow.com/a/44573867
  • Loading branch information
vimpostor committed Aug 14, 2024
1 parent 4afb2b2 commit 3e038df
Show file tree
Hide file tree
Showing 5 changed files with 8 additions and 3 deletions.
1 change: 1 addition & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Bug fixes:
- Fix link to stable release of tox in documentation.
- Fix a bad bytes replace in unescape_char.
- Handle ``ValueError`` in ``vBinary.from_ical``.
- Ignore the BOM character in incorrectly encoded ics files.

6.0.0a0 (2024-07-03)
--------------------
Expand Down
2 changes: 1 addition & 1 deletion src/icalendar/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def main():
argv = parser.parse_args()

for calendar_file in argv.calendar_files:
with open(calendar_file) as f:
with open(calendar_file, encoding='utf-8-sig') as f:
calendar = Calendar.from_ical(f.read())
for event in calendar.walk('vevent'):
argv.output.write(view(event) + '\n\n')
Expand Down
4 changes: 2 additions & 2 deletions src/icalendar/parser_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ def from_unicode(value: ICAL_TYPE, encoding='utf-8') -> bytes:
return value


def to_unicode(value: ICAL_TYPE, encoding='utf-8') -> str:
def to_unicode(value: ICAL_TYPE, encoding='utf-8-sig') -> str:
"""Converts a value to unicode, even if it is already a unicode string.
"""
if isinstance(value, str):
Expand All @@ -31,7 +31,7 @@ def to_unicode(value: ICAL_TYPE, encoding='utf-8') -> str:
try:
value = value.decode(encoding)
except UnicodeDecodeError:
value = value.decode('utf-8', 'replace')
value = value.decode('utf-8-sig', 'replace')
return value


Expand Down
2 changes: 2 additions & 0 deletions src/icalendar/tests/calendars/bom_calendar.ics
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
BEGIN:VCALENDAR
END:VCALENDAR
2 changes: 2 additions & 0 deletions src/icalendar/tests/test_bom_calendar.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
def test_bom_calendar(calendars):
assert calendars.bom_calendar.walk('VCALENDAR'), "Unable to parse a calendar starting with an Unicode BOM"

0 comments on commit 3e038df

Please sign in to comment.