My Favorite Bugs: Invalid Surrogate Pairs

(george.mand.is)

26 points | by meysamazad 2 hours ago

5 comments

  • BobbyTables2 6 minutes ago
    Damn, I’ve never really had to deal with Unicode all that much.

    Was already bad enough that instead of bytes, we have to worry about code points. Now even that isn’t enough?

    It would have been expensive, but all characters should have been fixed size 64bit values.

  • georgemandis 20 minutes ago
    Just noticed this is getting some traffic! It's a little buried in the post, but I made an interactive tool for exploring surrogate pairs as part of this: - https://george.mand.is/invalid-surrogate-pairs/

    I thought it was something that's easier to play with and feel than necessarily just read about.

  • jonhohle 35 minutes ago
    Once I ran into this it became hard to treat strings “normally” in any situation or, alternatively, I’d force hard encoding requirements in the domain. Regardless, handling grapheme clusters properly is hard and easy to get wrong.

    I recently ported a program from python to rust and the original author used string regexes. Input and output document encoding mattered but the characters that needed to be matched were always lower ASCII. The python program could have used binary regexes, but instead forced an input encoding (UTF-8) and made the user choose an output encoding. When the input comes from an unknown process or legacy data, however, you don’t always get the luxury of assuming the encoding. Switching to binary regexes and ignoring encoding altogether simplified logic, eliminated classes of errors, and made the program work in scenarios it couldn’t earlier. Getting rid of the last decoding/encoding code gave me so much relief, especially when all of the whacky encoding tests I had already written continued to work.

  • skybrian 7 minutes ago
    Writing property tests on functions that work with strings is a good way to find lots of Unicode issues.
  • wupatz 28 minutes ago
    it's good to know about surrogate pairs in unicode. It was new to me too when being part of tracking down incomplete uniode flags in the (excellent) phanpy mastodon client.

    Author went for Intl.Segmenter too: https://github.com/cheeaun/phanpy/issues/1491