There does not exist a good, flexible, backwards-compatible solution for Text in C++. Choosing a library either requires taking the entire thing (Qt), getting involved in a very complex interface (ICU), dealing with sometimes limiting API choices (CopperSpice), or having a well-done but very opinionated design set (the proposed-for-Boost Boost.Text library).

Where is the standard library-friendly, maximum performance solution for handling text encoding and decoding in C++ and C?

This project is the push to reach that goal.

Current Funding

The project has the goal of being fully funded by mid 2020 so that all users can have a high quality solution of text that is not kept within one company or ecosystem, but ported to the Standard Library for use by everyone. If you want the proposal, please e-mail me.

Funding goes toward:

  • Funding development;
  • Targeting specific features;
  • Covering general library support;
  • Covering specific company or vendor support;
  • and, Attending WG14 (C Committee) and WG21 (C++ Committee) meetings.

Specialized solutions for C++11 (or C++03) can be made. If you, your company or organization is interested in helping or need special features/early access to features listed below, please get in touch.

Funding Goals and Progress

Below are the published funding goals. Sponsors may pay into specific goals or, if given a large enough donation, create a new goal entirely; otherwise, funding falls into the categories in a top-to-bottom, linear fashion. Goals marked (Stretch) are not quite bare-minimum necessary, but would be absolutely wonderful to accomplish!

  • Bootstrap Initial Development, to get library tested and released;
  • Reach Full-Time Text Development to reach 2020 Goal;
  • Cover C Standard Library development to reach maximum amount of users with basic functionality;

Current Goal: Bootstrap Initial Development

Current Goal Total: $4,375.64 USD / $24,000.00 USD

[ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Technical Details

The work is ongoing.

The C++ library submodules and builds on top of the C one for fast-path functions. Internally, the C library is implemented with C++ and – hopefully soon in the future – vectorized by hand or with SIMD/std::experimental::simd. Document trails:

The principles and inner workings of the implementation are detailed in a series of talks, slides and posts:

  1. C++ on Sea 2020 🤿 Deep C Diving - Fast and Scalable Text Interfaces at the Bottom 🤿
    July 16th, 2019
    Virtual Conference
  2. C++ Russia Moscow 2020
    🏎 Burning Silicon - Speed for Transcoding in C++23
    June 30th, 2020
    Virtual Conference
  3. Pure Virtual C++ 2020
    Lucky 7 - Designing Text Encodings for C++
    April 30th, 2020
    Virtual Conference
    • Abstract: Text handling in the C and C++ Standards is a tale of legacy encodings and a demonstration of decisions made that work at the moment don’t scale up to the needs of tomorrow. With Unicode on the horizon, C++20 prepared fundamental changes such as char8_t and polishing a things to make it easier to catch bad conversions and logical program errors when working with encoded text. Still, the landscape has poor support for transcoding from one encoding to the other, let alone talking about higher level algorithms such as how to compare two text forms which render identical to the user but have different bit patterns. This talk explores the fundamental design space behind Encoding, Decoding and Transcoding text. It describes the benefits of the API under active consideration of text, potential speed gains from such an API, and how it enables better handling of complex tasks such as normalization.
    • Video
    • Slides
  4. Meeting C++ 2019
    Catching ⬆️: Unicode for C++ in Greater Detail - 2 of 5
    Saturday, November 16th, 2019
    Berlin, Germany
  5. CppCon 2019
    Catching ⬆️: The (Baseline) Unicode Plan for C++23
    Friday, September 20th, 2019
    Aurora, Colorado
  6. Study Group 16 - Text and Unicode
    A Rudimentary Unicode Abstraction
    Wednesday, March 7th, 2018
    Boston, Massachusetts

The current spread of goals is as follows.

Ⅰ: Core Text Utilities [ 36% ]

  • Encoding objects for one-by-one encoding and decoding.
    • utf8, utf16, utf32, narrow_execution and wide_execution Encoding Object types;
    • and, basic_utf8<char_type>, basic_utf16<char_type>, and basic_utf32<char_type> types.
  • decode(...), encode(...), and transcode(...) functions.
  • decode_view<encoding, ...>, encode_view<encoding, ...>, and transcode_view<encoding, ...> range types.

[ ⣿⣿⣿⣿⣿⣿⣿⣿⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅱ: Normalization Forms [ 0% ]

  • All four Unicode Normalization Forms, as specified in UAX #15.
    • Canonical Form nfc
    • Canonical Form nfd
    • Compatibility Form nfkc
    • Compatibility Form nfkd
  • text_view<Encoding, NormalizationForm, Container>
  • text<Encoding, NormalizationForm, Container>

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅲ: User Extensibility Hooks for (User) Encodings [ 0% ]

  • text_encode, text_decode, text_transcode, and text_transcode_one free function ADL hooks

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅳ: Byte Buffers and Streaming [ 0% ]

  • encoding_scheme<Encoding, endian, Byte> - transformative encoding that always presents its code unit type as Byte (defaults to std::byte).
  • incomplete_handler<Handler> - finish incomplete sequences and execute underlying handler if sequence is incomplete.

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅴ: CJK Encoding Tests [ 0% ]

  • Implement gb18030, the official government Unicode Transformation Format encoding of PRC.
  • Implement legacy shift_jis/euc_jp/iso2022_jp legacy encodings.
    • priority goes to shift_jis/euc_jp as it encodes more traffic.

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅵ: (Stretch) Enhanced Execution Encoding [ 0% ]

  • Reach into platform-specific functions to rip out guts of platform’s current encoding to ensure preservation of Unicode in:
    • narrow_execution
    • wide_execution

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅶ: (Stretch) Hyper-Scrutinized Vectorization Implementation [ 0% ]

  • Apply vectorization techniques for conversions to pairs of encodings in ascii, utf8, utf16, and utf32.

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅷ: (Stretch) C Library for Span-Based Conversions [ 0% ]

  • As detailed in proposal N2440: C functions for fast conversions.
  • Cover INCITS/ANSI fees.
  • Take functionality through all of WG14, put into C Libraries such as:
    • musl.
    • glibc.
    • Potentially: new LLVM libc.

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅸ: (Stretch) WHATWG Encoding Functionality [ 0% ]

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅹ: (Stretch) Strong Exception Guarantee [ 0% ]

  • std::text<Encoding, NormalizationForm, Container>: strong exception guarantee on all applicable operations.
  • noexcept container support for std::text and std::text_view
    • noexcept allocator support.
    • Containers operations are made conditionally noexcept if possible based on the allocator and movability of the inserted types.

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]