Skip to content

Commit

Permalink
added ZnLossyUTF8Encoder
Browse files Browse the repository at this point in the history
  • Loading branch information
svenvc authored and svenvc committed May 10, 2024
1 parent 62cdf5e commit 33b049b
Show file tree
Hide file tree
Showing 16 changed files with 105 additions and 2 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
I am ZnLossyUTF8Decoder.
I am a ZnUTF8Decoder.

I behave like my superclass but will not signal errors when I see illegal UTF-8 encoded input,
instead I will output a Unicode Replacement Character (U+FFFD) for each error.

In contrast to my superclass I can read any random byte sequence, decoding both legal and illegal UTF-8 sequences.

Due to my stream based design and usage as well as my stateless implementation,
I will output multiple replacement characters when multiple illegal sequences occur.

My convenience method #decodeBytesSingleReplacement: shows how to decode bytes so that
only a single replacement character stands for any amount of illegal encoding between legal encodings.

Part of Zinc HTTP Components.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
accessing
handlesEncoding: string
"Return true when my instances handle the encoding described by string"

^ (self canonicalEncodingIdentifier: string) = 'utf8lossy'
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
accessing
knownEncodingIdentifiers
^ #( utf8lossy )
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
convenience
decodeBytesSingleReplacement: bytes
"Decode bytes and return the resulting string.
This variant of #decodeBytes: will only ever use
a single replacement character for each illegal UTF-8 sequence"

| byteStream replaced replacement char |
byteStream := bytes readStream.
replaced := false.
replacement := self replacementCodePoint asCharacter.
^ String streamContents: [ :stream |
[ byteStream atEnd ] whileFalse: [
char := self nextFromStream: byteStream.
char = replacement
ifTrue: [
replaced
ifFalse: [
replaced := true.
stream nextPut: replacement ] ]
ifFalse: [
replaced := false.
stream nextPut: char ] ] ]
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
error handling
errorIllegalContinuationByte
^ self replacementCodePoint
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
error handling
errorIllegalLeadingByte
^ self replacementCodePoint
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
error handling
errorIncomplete
^ self replacementCodePoint
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
error handling
errorOutsideRange
^ self replacementCodePoint
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
error handling
errorOverlong
^ self replacementCodePoint
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
accessing
identifier
^ #utf8lossy
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
accessing
replacementCodePoint
"Return the code point for the Unicode Replacement Character U+FFFD"

^ 16rFFFD
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"commentStamp" : "<historical>",
"super" : "ZnUTF8Encoder",
"category" : "Zinc-Character-Encoding-Core",
"classinstvars" : [ ],
"pools" : [ ],
"classvars" : [ ],
"instvars" : [ ],
"name" : "ZnLossyUTF8Encoder",
"type" : "normal"
}
Original file line number Diff line number Diff line change
@@ -1 +1 @@
self packageOrganizer ensurePackage: #'Zinc-Character-Encoding-Core' withTags: #()!
SystemOrganization addCategory: #'Zinc-Character-Encoding-Core'!
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
testing
testLossyUTF8
| encoder replacement |
encoder := ZnLossyUTF8Encoder new.
self assert: #utf8lossy asZnCharacterEncoder equals: encoder.
replacement := encoder replacementCodePoint asCharacter.
self
assert: (#[65 160 66] decodeWith: encoder)
equals: ({ $A. replacement . $B } as: String).
self
assert: (#[16rE1 16rA0 16rC0] decodeWith: encoder)
equals: replacement asString.
self
assert: (encoder decodeBytes: #[16r41 16rA1 16rA2 16rA3 16r42])
equals: ({ $A. replacement . replacement . replacement . $B } as: String).
self
assert: (encoder decodeBytesSingleReplacement: #[16r41 16rA1 16rA2 16rA3 16r42])
equals: ({ $A. replacement . $B } as: String).
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
testing
testLossyUTF8Random
| bytes string |
bytes := ((1 to: 10000) collect: [ :_ | 256 atRandom - 1 ]) asByteArray.
string := bytes decodeWith: ZnLossyUTF8Encoder new.
self assert: string isString
Original file line number Diff line number Diff line change
@@ -1 +1 @@
self packageOrganizer ensurePackage: #'Zinc-Character-Encoding-Tests' withTags: #()!
SystemOrganization addCategory: #'Zinc-Character-Encoding-Tests'!

0 comments on commit 33b049b

Please sign in to comment.