Search: [unicode] - Les liens de la mer numérique

Signal has open-sourced a SQLite extension that provides better support for non-latin languages (Chinese, Japanese, etc) in the Full-Text Search (FTS) virtual table.

sqlite · project · rust · programming · unicode

December 17, 2023 02:11:49 PM GMT+01:00 · permalink

·

https://darksi.de/13.sqlite-fts5-structure/

·

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) @ tonsky.me

UTF-8 is an encoding. Encoding is how we store code points [of Unicode] in memory.

The simplest possible encoding for Unicode is UTF-32. It simply stores code points as 32-bit integers.

UTF-8 is a variable-length encoding. A code point might be encoded as a sequence of one to four bytes. One or more code points can build a character.

Side effects of UTF-8:

You CAN’T determine the length of the string by counting bytes.
You CAN’T randomly jump into the middle of the string and start reading.
You CAN’T get a substring by cutting at arbitrary byte offsets. You might cut off part of the character.

If you want a character comparison, you should be iterating on "extended grapheme clusters", or graphemes. A grapheme is a minimally distinctive writing unit in the context of a particular writing system. ö is one grapheme. é is one too. And 각.

Is Unicode hard only because of emojis?
No, for example, ö (German) is a single character, but multiple code points (U+006F U+0308).

What is 🤦🏼‍♂️ length?
It depends of the encoding used: 5 for Python, 7 for JavaScript / Java / C#, and 17 in Rust. That’s what extended grapheme clusters are all about what humans perceive as a single character. And in this case, 🤦🏼‍♂️ is undoubtedly a single character.

Before comparing strings or searching for a substring, normalize!

Because code points can be in different order for a grapheme. Also we want to be able to search for 2 in 𝕏².

Unicode is locale-dependent, because two grapheme with the same code points can look different in two languages.

So no, you can’t convert string to lowercase without knowing what language that string is written in. [...] I live in the US/UK, should I even care?

Yes.

What are surrogate pairs?

Unicode decided to allocate some of these 65,536 characters to encode higher code points, essentially converting fixed-width UCS-2 into variable-width UTF-16.

A surrogate pair is two UTF-16 units used to encode a single Unicode code point. For example, D83D DCA9 (two 16-bit units) encodes one code point, U+1F4A9.
The top 6 bits in surrogate pairs are used for the mask, leaving 2×10 free bits to spare: 1101 10?? ???? ???? to 1101 11?? ???? ????'

Is UTF-16 still alive?

Yes. The only downside of UTF-16 is that everything else is UTF-8, so it requires conversion every time a string is read from the network or from disks.

unicode · programming · guide

October 7, 2023 08:27:48 PM GMT+02:00 * · permalink

·

https://tonsky.me/blog/unicode/

·

Trojan Source: Invisible Vulnerabilities | Light Blue Touchpaper

Oh boy... that's pretty scary.
To deliberately introduce security holes, sometimes minor changes are enough. For example replace "==" (comparison sign) by "=" (assignment). These "attacks" are visible to a trained eye.

But what happens if the eye can't see anymore? With Unicode, it is possible to use characters that look like our Latin alphabet, but are not, or worse change the writing order (left-right) so that the text is displayed one way in the text editor, while the compiler will interpret it differently. This opens up the possibility of inserting security holes that are almost impossible to see, even if you have the source code in front of you in your text editor.
(For an example of left-right inversion, go to this page: https://sebsauvage.net/wiki/ and look for my email address in the page: It shows up normally, but if you look at the html source, it shows up as a different text).

I think it would be interesting if text editors had an option to display in a particular color everything that is not purely "Latin text" (0000-024F), as well as Unicode characters that cause changes (backspace, change of direction).

Proof-of-concept of this attack in different languages can be seen here: https://github.com/nickboucher/trojan-source

(from https://sebsauvage.net/links/?QRVnDw)

We can develop an extension for each editor that highlights these characters easily !

security · unicode

February 1, 2022 08:42:25 AM GMT+01:00 * · permalink

·

https://www.lightbluetouchpaper.org/2021/11/01/trojan-source-invisible-vulnerabilities/

·

Unicode sorting is hard & why browsers added special emoji matching to regexp

unicode · javascript

July 9, 2021 12:21:47 PM GMT+02:00 * · permalink

·

https://devlog.hexops.com/2021/unicode-sorting-why-browsers-added-special-emoji-matching

·

Font Generator & Font Changer - Cool Fancy Text Generator

Transform a text to another design using unicode only 👍

#idea #project : a cli-tool take it as input and return the transformed string. Sounds easy first, but it is definitely hard if you want to support unicode !

font · generator · unicode

June 27, 2021 01:24:18 PM GMT+02:00 * · permalink

·

https://coolsymbol.com/cool-fancy-text-generator.html

·

Exprimez vos émotions en regex !

Using regex to match emojis and flags 👍

Links about unicode are available at the end of the article too.

Astuce : les drapeaux des pays sont codés avec des "Symbole indicateur régional lettre".
Le A encodé \u1f1e6 (🇦) jusqu`au Z encodé \u1f1ff (🇿).
le drapeau français est codé "FR", celui allemand "DE".
Pour écrire le drapeau de la France, il faut donc le \u1f1eb (🇫) et le \u1f1f7 (🇷). Écris collé, ça donne 🇫🇷 .
De même pour le drapeau allemand (\u1f1E9\u1f1EA) : 🇩🇪

EDIT : old codes are available in a previous shaare

unicode · regex · emoji

November 4, 2019 09:24:20 PM GMT+01:00 * · permalink

·

https://www.synbioz.com/blog/tech/exprimez-vos-%C3%A9motions-en-regex

·