• m_‮f@discuss.online
    link
    fedilink
    English
    arrow-up
    57
    ·
    6 months ago

    ; and ; respectively, in case anyone wants to see how it renders on their machine and is also lazy.

    • anton
      link
      fedilink
      arrow-up
      4
      ·
      6 months ago

      As if a white space sensitive language protects from this fuckery.

      • How many thin spaces are one level of indentation?
      • Will anyone notice a hair space?
      • Who can tell the difference between a space and a figure space? they are the same size in a mono spaced font
  • anton@piefed.blahaj.zone
    link
    fedilink
    English
    arrow-up
    24
    ·
    6 months ago

    My IDE says: '(', '+', '-', '.', ';', <operator>, '[' or '}' expected, got ';'
    But the rust compiler explains

    error: unknown start of token: \u{37e}  
    help: Unicode character ';' (Greek Question Mark) looks like ';' (Semicolon), but it is not```   
    what a killjoy.
    • Frezik
      link
      fedilink
      English
      arrow-up
      2
      ·
      6 months ago

      Since they’re fundamentally predicting the next token, and there isn’t a lot of training data out there that would actually do this, I wouldn’t expect that LLMs are going to start putting in lookalike characters. They only lookalike to humans.

      That said, you could probably poison their training datasets this way.

      • lordnikon@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        6 months ago

        Yeah that was the idea get the llms to start using look alike characters to poison their outputs.

  • Infernal_pizza@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    2
    ·
    6 months ago

    Why do characters like this even exist? I’ve run into this before where I couldn’t find a file I’d downloaded by searching for it. I remembered what folder it was in and checked it was still there, after playing around with the name for a bit I realised the “a” in the file name wasn’t actually an a.

    • Frezik
      link
      fedilink
      English
      arrow-up
      6
      ·
      6 months ago

      Simple answer is that Unicode is a design by committee attempting to make every single human written language work. It’s more complicated than it needs to be, but we also don’t want to redo all the work it would take to replace it with something more sane. Especially KJC languages. Trying to get those three to agree on anything is for people who deal with frustration better than me.