There does not exist a good, flexible, backwards-compatible solution for Text in C++. Choosing a library either requires taking the entire thing (Qt), getting involved in a very complex interface (ICU), dealing with sometimes limiting API choices (CopperSpice), or having a well-done but very opinionated design set (the proposed-for-Boost Boost.Text library).

Where is the standard library-friendly, maximum performance solution for handling text encoding and decoding in C++ and C?

This project is the push to reach that goal.

Publicly Available Implementation

The Publicly-Available Implementation is here: https://ztdtext.rtfd.io. You can track progress on this page, through the documentation’s “Progress & Future Work” section, or at the the GitHub Repository.

Liberate your text using the ztd.text library.

The C Library implementation — Cuneicode — will be made publicly available as funding, scholarship, and sponsorship goals are reached.

Current Funding

Funding goes toward:

  • Funding development;
  • Targeting specific features;
  • Covering general library support;
  • Covering specific company or vendor support;
  • and, Attending WG14 (C Committee) and WG21 (C++ Committee) meetings.

Specialized solutions for C++11 (or C++03) can be made. If you, your company or organization is interested in helping or need special features/early access to features listed below, please get in touch with these folk through their website or by e-mail.

Funding Goals and Progress

Below are the published funding goals. Sponsors may pay into specific goals or, if given a large enough donation, create a new goal entirely; otherwise, funding falls into the categories in a top-to-bottom, linear fashion. Goals marked (Stretch) are not quite bare-minimum necessary, but would be absolutely wonderful to accomplish!

  • [🎊 Accomplished!] Bootstrap Initial Development, to get library tested and released;
  • Normalization Forms and C-based Span Implementation (Cuneicode C Library)
  • WHATWG and CJK Encoding Tests
  • Cover C Standard Library development to reach maximum amount of users with basic functionality;
  • Reach Full-Time Text Development to reach 2022 Goal;

Current Goal: Normalization Forms and C-based Span Implementation

Current Goal Total: $1,275.55 USD / $20,000.00 USD

[ ⣿⣿⣤⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Technical Details

The work is ongoing. The latest public documentation for the released library can be found here: https://ztdtext.rtfd.io.

The C++ library submodules and builds on top of the C one for fast-path functions. Internally, the C library is implemented with C++ and – hopefully soon in the future – vectorized by hand or with SIMD/std::experimental::simd. Document trails:

The principles and inner workings of the implementation are detailed in a series of talks, slides and posts:

  1. !!Con 2021 😱 Oh, No! 😱 The Lowest-level‡ Programming Language is Unicode-aware and I have no excuses?!
    May 20th, 2021
    Virtual Conference
  2. C++ on Sea 2020 🤿 Deep C Diving - Fast and Scalable Text Interfaces at the Bottom 🤿
    July 16th, 2019
    Virtual Conference
  3. C++ Russia Moscow 2020
    🏎 Burning Silicon - Speed for Transcoding in C++23
    June 30th, 2020
    Virtual Conference
  4. Pure Virtual C++ 2020
    Lucky 7 - Designing Text Encodings for C++
    April 30th, 2020
    Virtual Conference
    • Abstract: Text handling in the C and C++ Standards is a tale of legacy encodings and a demonstration of decisions made that work at the moment don’t scale up to the needs of tomorrow. With Unicode on the horizon, C++20 prepared fundamental changes such as char8_t and polishing a things to make it easier to catch bad conversions and logical program errors when working with encoded text. Still, the landscape has poor support for transcoding from one encoding to the other, let alone talking about higher level algorithms such as how to compare two text forms which render identical to the user but have different bit patterns. This talk explores the fundamental design space behind Encoding, Decoding and Transcoding text. It describes the benefits of the API under active consideration of text, potential speed gains from such an API, and how it enables better handling of complex tasks such as normalization.
    • Video
    • Slides
  5. Meeting C++ 2019
    Catching ⬆️: Unicode for C++ in Greater Detail - 2 of 5
    Saturday, November 16th, 2019
    Berlin, Germany
  6. CppCon 2019
    Catching ⬆️: The (Baseline) Unicode Plan for C++23
    Friday, September 20th, 2019
    Aurora, Colorado
  7. Study Group 16 - Text and Unicode
    A Rudimentary Unicode Abstraction
    Wednesday, March 7th, 2018
    Boston, Massachusetts

The current spread of goals is as follows.

Ⅰ: Core Text Utilities [ 🎉 COMPLETE 🎉 ]

Finished and documented here:

[ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ]

Ⅱ: User Extensibility Hooks for (User) Encodings [ 🎉 COMPLETE 🎉 ]

Finished and documented here: https://ztdtext.readthedocs.io/en/latest/design/lucky%207%20extensions/speed.html

[ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ]

Ⅲ: Byte Buffers and Streaming [ 🎉 COMPLETE 🎉 ]

Finished and Documented here: https://ztdtext.readthedocs.io/en/latest/api/encodings/encoding_scheme.html

[ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ]

Ⅳ: Normalization Forms [ 5% ]

  • All four Unicode Normalization Forms, as specified in UAX #15.
    • Canonical Form nfc
    • Canonical Form nfd
    • Compatibility Form nfkc
    • Compatibility Form nfkd
  • text_view<Encoding, NormalizationForm, Container>
  • text<Encoding, NormalizationForm, Container>

[ ⣿⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅴ: CJK Encoding Tests [ 0% ]

  • Implement gb18030, the official government Unicode Transformation Format encoding of PRC.
  • Implement legacy shift_jis/euc_jp/iso2022_jp legacy encodings.
    • priority goes to shift_jis/euc_jp as it encodes more traffic.

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅵ: (Stretch) Enhanced Execution Encoding [ 0% ]

  • Reach into platform-specific functions to rip out guts of platform’s current encoding to ensure preservation of Unicode in:
    • narrow_execution
    • wide_execution

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅶ: (Stretch) Hyper-Scrutinized Vectorization Implementation [ 0% ]

  • Apply vectorization techniques for conversions to pairs of encodings in ascii, utf8, utf16, and utf32.

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅷ: (Stretch) C Library for Span-Based Conversions [ 0% ]

  • As detailed in proposal N2440: C functions for fast conversions.
  • Cover INCITS/ANSI fees.
  • Take functionality through all of WG14, put into C Libraries such as:
    • musl.
    • glibc.
    • Potentially: new LLVM libc.

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅸ: (Stretch) WHATWG Encoding Functionality [ 0% ]

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]

Ⅹ: (Stretch) Strong Exception Guarantee [ 0% ]

  • std::text<Encoding, NormalizationForm, Container>: strong exception guarantee on all applicable operations.
  • noexcept container support for std::text and std::text_view
    • noexcept allocator support.
    • Containers operations are made conditionally noexcept if possible based on the allocator and movability of the inserted types.

[ ⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀ ]