beenull/ladybird

mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2024-09-29 16:21:29 +00:00

Author	SHA1	Message	Date
Shannon Booth	0b864bef60	LibTextCodec: Implement UTF8Decoder::to_utf8 using AK::String String::from_utf8_with_replacement_character is equivalent to https://encoding.spec.whatwg.org/#utf-8-decode from the encoding spec, so we can simply call through to it.	2024-08-12 06:38:58 -04:00
BenJilks	0ca5675d59	LibTextCodec: Implement `iso-2022-jp` encoder Implements the `iso-2022-jp` encoder, as specified by https://encoding.spec.whatwg.org/#iso-2022-jp-encoder	2024-08-08 17:49:58 +01:00
BenJilks	08a8d67a5b	LibTextCodec: Implement `shift_jis` encoder Implements the `shift_jis` encoder, as specified by https://encoding.spec.whatwg.org/#shift_jis-encoder	2024-08-08 17:49:58 +01:00
BenJilks	d80575a410	LibTextCodec: Implement `gb18030` and `gbk` encoders Implements the `gb18030` and `gbk` encoders, as specified by https://encoding.spec.whatwg.org/#gb18030-encoder https://encoding.spec.whatwg.org/#gbk-encoder	2024-08-08 17:49:58 +01:00
BenJilks	34c8c559c1	LibTextCodec: Implement `big5` encoder Implements the `big5` encoder, as specified by https://encoding.spec.whatwg.org/#big5-encoder	2024-08-08 17:49:58 +01:00
BenJilks	826292536c	LibTextCodec: Implement `euc-kr` encoder Implements the `euc-kr` encoder, as specified by https://encoding.spec.whatwg.org/#euc-kr-encoder	2024-08-08 17:49:58 +01:00
BenJilks	72d0e3284b	LibTextCodec+LibURL: Implement `utf-8` and `euc-jp` encoders Implements the corresponding encoders, selects the appropriate one when encoding URL search params. If an encoder for the given encoding could not be found, fallback to utf-8.	2024-08-08 17:49:58 +01:00
Andreas Kling	1a46d8df5f	LibTextCodec: Use String::from_utf8() when decoding UTF-8 to UTF-8 This way, we still perform UTF-8 validation, but don't go through the slow generic code path that rebuilds the decoded string one code point at a time. This was a bottleneck when loading a canned copy of reddit.com, which ended up being ~120 MiB large. - Time spent decoding UTF-8 before this change: 1192 ms - Time spent decoding UTF-8 after this change: 154 ms That's still a long time, but 7.7x faster is nothing to sneeze at! :^) Note that if the input fails UTF-8 validation, we still fall back to the slow path and insert replacement characters per the WHATWG Encoding spec: https://encoding.spec.whatwg.org/#utf-8-decode	2024-07-20 14:29:37 +02:00
Timothy Flynn	368dad54ef	LibTextCodec: Use AK facilities to validate and convert UTF-16 to UTF-8 This allows LibTextCodec to make use of simdutf, and also reduces the number of places with manual UTF-16 implementations.	2024-07-18 19:43:57 +02:00
Simon Wanner	0ab4722cee	LibTextCodec: Use generated lookup tables for all single byte decoders	2024-06-04 10:21:07 +02:00
Simon Wanner	6b2c459901	LibTextCodec: Fix ISO-8859-1 vs. windows-1252 handling in web contexts The Encoding specification maps ISO-8859-1 to windows-1252 and expects the windows-1252 translation table to be used, which differs from ISO-8859-1 for 0x80-0x9F. Other contexts expect to get the actual ISO-8859-1 encoding, with 1-to-1 mapping to U+0000-U+00FF, when requesting it. `decoder_for_exact_name` is introduced, which skips the mapping from aliases to the encoding name done by `get_standardized_encoding`.	2024-06-04 10:21:07 +02:00
Simon Wanner	46d5cf0443	LibTextCodec: Fix some incorrect encoding aliases	2024-06-04 10:21:07 +02:00
Simon Wanner	09f2d79cb1	LibTextCodec: Bring TextCodec::get_standardized_encoding closer to spec	2024-06-04 10:21:07 +02:00
Simon Wanner	11bb216912	LibTextCodec: Add replacement decoder	2024-05-31 07:56:26 +02:00
Simon Wanner	7f3b457e62	LibTextCodec: Add EUC-KR decoder	2024-05-31 07:56:26 +02:00
Simon Wanner	ded6512ca8	LibTextCodec: Add Shift_JIS decoder	2024-05-31 07:56:26 +02:00
Simon Wanner	06f7c393b2	LibTextCodec: Add ISO-2022-JP decoder	2024-05-31 07:56:26 +02:00
Simon Wanner	45f0ae52be	LibTextCodec: Add EUC-JP decoder	2024-05-31 07:56:26 +02:00
Simon Wanner	9943bb1d8e	LibTextCodec: Add Big5 decoder	2024-05-31 07:56:26 +02:00
Simon Wanner	2ce61fe6ea	LibTextCodec: Add GBK/GB18030 decoder Includes changes from GB-18030-2022, which are not yet included in the Encoding Specification, but WebKit, Blink and WPT are already updated.	2024-05-31 07:56:26 +02:00
Simon Wanner	9ed52504ab	LibTextCodec: Delegate to process() in default validate() implementation	2024-05-31 07:56:26 +02:00
Simon Wanner	88c2586f25	LibTextCodec: Remove unused decoder classes	2024-05-31 07:56:26 +02:00
Simon Wanner	b79815c5a5	LibTextCodec: Add x-mac-cyrillic decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	07a9435da5	LibTextCodec: Add windows-1258 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	275b89720b	LibTextCodec: Add windows-1257 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	c76308c7e6	LibTextCodec: Add windows-1256 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	eb9ed10573	LibTextCodec: Add windows-1253 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	2d35687db0	LibTextCodec: Add windows-874 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	1b6878b6ca	LibTextCodec: Add KOI8-U decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	1fd3a6f48c	LibTextCodec: Add ISO-8859-16 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	3e882f26db	LibTextCodec: Sort checks in decoder_for mostly alphabetically Keeps checks for common encodings (Latin1 & UTF-*) at the top.	2024-05-27 20:50:50 +02:00
Simon Wanner	56241df604	LibTextCodec: Add ISO-8859-14 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	4188e328ac	LibTextCodec: Add ISO-8859-13 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	cc640f4363	LibTextCodec: Add ISO-8859-10 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	d73220837e	LibTextCodec: Add ISO-8859-8(-I) decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	24028e353e	LibTextCodec: Add ISO-8859-7 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	01c3b8091a	LibTextCodec: Add ISO-8859-6 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	763d904ad5	LibTextCodec: Add ISO-8859-5 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	c6b17320db	LibTextCodec: Add ISO-8859-4 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	6c84edaaa2	LibTextCodec: Add ISO-8859-3 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	fc783199f1	LibTextCodec: Add IBM866 decoder	2024-05-27 20:50:50 +02:00
Simon Wanner	96b3c35358	LibTextCodec: Implement table based decoders as SingleByteDecoder Instead of copy-pasting the implementation, let's use a single class. This "Single Byte Decoder" concept even exists in the Encoding Spec :^)	2024-05-27 20:50:50 +02:00
Michal Grich	7a6d84d036	LibTextCodec: Add Windows-1250 text decoder This commit is adding Windows-1250 decoding based on unicode.org mapping table.	2024-04-23 16:26:16 +02:00
Andreas Kling	3c039903fb	LibTextCodec+AK: Don't validate UTF-8 strings twice UTF8Decoder was already converting invalid data into replacement characters while converting, so we know for sure we have valid UTF-8 by the time conversion is finished. This patch adds a new StringBuilder::to_string_without_validation() and uses it to make UTF8Decoder avoid half the work it was doing.	2023-12-30 13:49:50 +01:00
Nico Weber	8f47acee6a	LibTextCodec: Add PDFDocEncoding decoder	2023-11-22 09:08:06 -07:00
Idan Horowitz	079c96376c	LibTextCodec: Support validating encoded inputs	2023-11-17 16:02:36 +01:00
Luke Wilde	eaa4048870	LibTextCodec: Add "get output encoding" from the Encoding specification	2023-06-19 06:12:26 +02:00
Timothy Flynn	00fa23237a	LibTextCodec: Change UTF-8's decoder to replace invalid code points The UTF-8 decoder will currently crash if it is provided invalid UTF-8 input. Instead, change its behavior to match that of all other decoders to replace invalid code points with U+FFFD. This is required by the web.	2023-05-12 05:47:36 +02:00
Andreas Kling	a504ac3e2a	Everywhere: Rename equals_ignoring_case => equals_ignoring_ascii_case Let's make it clear that these functions deal with ASCII case only.	2023-03-10 13:15:44 +01:00
Luke Wilde	e864444fe3	LibTextCodec/Latin1: Iterate over input string with u8 instead of char Using char causes bytes equal to or over 0x80 to be treated as a negative value and produce incorrect results when implicitly casting to u32. For example, `atob` in LibWeb uses this decoder to convert non-ASCII values to UTF-8, but non-ASCII values are >= 0x80 and thus produces incorrect results in such cases: ```js Uint8Array.from(atob("u660"), c => c.charCodeAt(0)); ``` This used to produce [253, 253, 253] instead of [187, 174, 180]. Required by Cloudflare's IUAM challenges.	2023-02-28 08:46:06 +00:00

1 2

87 commits