<i id='aGPRP'><tr id='aGPRP'><dt id='aGPRP'><q id='aGPRP'><span id='aGPRP'><b id='aGPRP'><form id='aGPRP'><ins id='aGPRP'></ins><ul id='aGPRP'></ul><sub id='aGPRP'></sub></form><legend id='aGPRP'></legend><bdo id='aGPRP'><pre id='aGPRP'><center id='aGPRP'></center></pre></bdo></b><th id='aGPRP'></th></span></q></dt></tr></i><div id='aGPRP'><tfoot id='aGPRP'></tfoot><dl id='aGPRP'><fieldset id='aGPRP'></fieldset></dl></div>

    • <bdo id='aGPRP'></bdo><ul id='aGPRP'></ul>
    1. <small id='aGPRP'></small><noframes id='aGPRP'>

      <tfoot id='aGPRP'></tfoot>
    2. <legend id='aGPRP'><style id='aGPRP'><dir id='aGPRP'><q id='aGPRP'></q></dir></style></legend>

      用问号替换无效的 UTF-8 字符,mbstring.substitute_c

      时间:2023-10-03
    3. <legend id='8xouy'><style id='8xouy'><dir id='8xouy'><q id='8xouy'></q></dir></style></legend>

          <i id='8xouy'><tr id='8xouy'><dt id='8xouy'><q id='8xouy'><span id='8xouy'><b id='8xouy'><form id='8xouy'><ins id='8xouy'></ins><ul id='8xouy'></ul><sub id='8xouy'></sub></form><legend id='8xouy'></legend><bdo id='8xouy'><pre id='8xouy'><center id='8xouy'></center></pre></bdo></b><th id='8xouy'></th></span></q></dt></tr></i><div id='8xouy'><tfoot id='8xouy'></tfoot><dl id='8xouy'><fieldset id='8xouy'></fieldset></dl></div>
                <tbody id='8xouy'></tbody>
              <tfoot id='8xouy'></tfoot>
            • <small id='8xouy'></small><noframes id='8xouy'>

                <bdo id='8xouy'></bdo><ul id='8xouy'></ul>
                本文介绍了用问号替换无效的 UTF-8 字符,mbstring.substitute_character 似乎被忽略了的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

                问题描述

                我想用引号 (PHP 5.3.5) 替换无效的 UTF-8 字符.

                I would like to replace invalid UTF-8 chars with quotation marks (PHP 5.3.5).

                到目前为止我有这个解决方案,但无效字符被删除,而不是被?"替换.

                So far I have this solution, but invalid characters are removed, instead of being replaced by '?'.

                function replace_invalid_utf8($str)
                {
                  return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
                }
                
                echo mb_substitute_character()."
                ";
                
                echo replace_invalid_utf8('éééaaaàààeeé')."
                ";
                echo replace_invalid_utf8('eeeaaaaaaeeé')."
                ";
                

                应该输出:

                63 // ASCII code for '?' character
                ???aaa???eé // or ??aa??eé
                eeeaaaaaaeeé
                

                但目前输出:

                63
                aaaee // removed invalid characters
                eeeaaaaaaeeé
                

                有什么建议吗?

                你会用另一种方式来做吗(例如使用 preg_replace()?)

                Would you do it another way (using a preg_replace() for example?)

                谢谢.

                推荐答案

                您可以使用mb_convert_encoding()htmlspecialchars()ENT_SUBSTITUTE> 自 PHP 5.4 起的选项.当然,您也可以使用 preg_match().如果您使用 intl,则可以使用 UConverter 自 PHP 5.5 起.

                You can use mb_convert_encoding() or htmlspecialchars()'s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5.

                无效字节序列的推荐替代字符是U+FFFD.参见3.1.2 替换格式错误的子序列";在 UTR #36:Unicode 安全注意事项中的详细信息.

                Recommended substitute character for invalid byte sequence is U+FFFD. see "3.1.2 Substituting for Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations for the details.

                使用 mb_convert_encoding() 时,您可以通过将 Unicode 代码点传递给 mb_substitute_character()mbstring.substitute_character 指令来指定替换字符.替换的默认字符是?(问号 - U+003F).

                When using mb_convert_encoding(), you can specify a substitute character by passing Unicode code point to mb_substitute_character() or mbstring.substitute_character directive. The default character for substitution is ? (QUESTION MARK - U+003F).

                // REPLACEMENT CHARACTER (U+FFFD)
                mb_substitute_character(0xFFFD);
                
                function replace_invalid_byte_sequence($str)
                {
                    return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
                }
                
                function replace_invalid_byte_sequence2($str)
                {
                    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
                }
                

                UConverter 提供面向过程和面向对象的 API.

                UConverter offers both procedual and object-oriented API.

                function replace_invalid_byte_sequence3($str)
                {
                    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
                }
                
                function replace_invalid_byte_sequence4($str)
                {
                    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
                }
                

                使用preg_match()时,需要注意字节范围,避免UTF-8非最短格式的漏洞.尾字节的范围根据前导字节的范围而变化.

                When using preg_match(), you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on the range of lead bytes.

                lead byte: 0x00 - 0x7F, 0xC2 - 0xF4
                trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)
                

                您可以参考以下资源来检查字节范围.

                you can refer to the following resources for checking the byte range.

                1. "UTF-8 字节序列的语法"在 RFC 3629 中
                2. "表 3-7.格式良好的 UTF-8 字节序列"在 Unicode 标准 6.1 中
                3. "多语言表单编码"在 W3C 国际化中"
                1. "Syntax of UTF-8 Byte Sequences" in RFC 3629
                2. "Table 3-7. Well-Formed UTF-8 Byte Sequences" in the Unicode Standard 6.1
                3. "Multilingual form encoding" in W3C Internationalization"

                字节范围表如下.

                      Code Points    First Byte Second Byte Third Byte Fourth Byte
                  U+0000 -   U+007F   00 - 7F
                  U+0080 -   U+07FF   C2 - DF    80 - BF
                  U+0800 -   U+0FFF   E0         A0 - BF     80 - BF
                  U+1000 -   U+CFFF   E1 - EC    80 - BF     80 - BF
                  U+D000 -   U+D7FF   ED         80 - 9F     80 - BF
                  U+E000 -   U+FFFF   EE - EF    80 - BF     80 - BF
                 U+10000 -  U+3FFFF   F0         90 - BF     80 - BF    80 - BF
                 U+40000 -  U+FFFFF   F1 - F3    80 - BF     80 - BF    80 - BF
                U+100000 - U+10FFFF   F4         80 - 8F     80 - BF    80 - BF
                

                如何在不破坏有效字符的情况下替换无效字节序列见"3.1.1 格式错误的子序列"在 UTR #36:Unicode 安全注意事项和表 3-8.U+FFFD在UTF-8转换中的使用"在 Unicode 标准中.

                How to replace invalid byte sequence without breaking valid characters is shown in "3.1.1 Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations and "Table 3-8. Use of U+FFFD in UTF-8 Conversion" in The Unicode Standard.

                Unicode 标准显示了一个示例:

                The Unicode Standard shows an example:

                before: <61    F1 80 80  E1 80  C2    62    80    63    80    BF    64  >
                after:  <0061  FFFD      FFFD   FFFD  0062  FFFD  0063  FFFD  FFFD  0064>
                

                这里是 preg_replace_callback() 根据上述规则的实现.

                Here is the implementation by preg_replace_callback() according to the above rule.

                function replace_invalid_byte_sequence5($str)
                {
                    // REPLACEMENT CHARACTER (U+FFFD)
                    $substitute = "xEFxBFxBD";
                    $regex = '/
                      ([x00-x7F]                       #   U+0000 -   U+007F
                      |[xC2-xDF][x80-xBF]            #   U+0080 -   U+07FF
                      | xE0[xA0-xBF][x80-xBF]       #   U+0800 -   U+0FFF
                      |[xE1-xECxEExEF][x80-xBF]{2} #   U+1000 -   U+CFFF
                      | xED[x80-x9F][x80-xBF]       #   U+D000 -   U+D7FF
                      | xF0[x90-xBF][x80-xBF]{2}    #  U+10000 -  U+3FFFF
                      |[xF1-xF3][x80-xBF]{3}         #  U+40000 -  U+FFFFF
                      | xF4[x80-x8F][x80-xBF]{2})   # U+100000 - U+10FFFF
                      |(xE0[xA0-xBF]                  #   U+0800 -   U+0FFF (invalid)
                      |[xE1-xECxEExEF][x80-xBF]    #   U+1000 -   U+CFFF (invalid)
                      | xED[x80-x9F]                  #   U+D000 -   U+D7FF (invalid)
                      | xF0[x90-xBF][x80-xBF]?      #  U+10000 -  U+3FFFF (invalid)
                      |[xF1-xF3][x80-xBF]{1,2}       #  U+40000 -  U+FFFFF (invalid)
                      | xF4[x80-x8F][x80-xBF]?)     # U+100000 - U+10FFFF (invalid)
                      |(.)                               # invalid 1-byte
                    /xs';
                
                    // $matches[1]: valid character
                    // $matches[2]: invalid 3-byte or 4-byte character
                    // $matches[3]: invalid 1-byte
                
                    $ret = preg_replace_callback($regex, function($matches) use($substitute) {
                
                        if (isset($matches[2]) || isset($matches[3])) {
                
                            return $substitute;
                
                        }
                    
                        return $matches[1];
                
                    }, $str);
                
                    return $ret;
                }
                

                通过这种方式可以直接比较字节,避免preg_match对字节大小的限制.

                You can compare byte directly and avoid preg_match's restriction about byte size by this way.

                function replace_invalid_byte_sequence6($str) {
                
                    $size = strlen($str);
                    $substitute = "xEFxBFxBD";
                    $ret = '';
                
                    $pos = 0;
                    $char;
                    $char_size;
                    $valid;
                
                    while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) {
                        $ret .= $valid ? $char : $substitute;
                    }
                
                    return $ret;
                }
                
                function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid)
                {
                    $valid = false;
                
                    if ($str_size <= $pos) {
                        return false;
                    }
                
                    if ($str[$pos] < "x80") {
                
                        $valid = true;
                        $char_size =  1;
                
                    } else if ($str[$pos] < "xC2") {
                
                        $char_size = 1;
                
                    } else if ($str[$pos] < "xE0")  {
                
                        if (!isset($str[$pos+1]) || $str[$pos+1] < "x80" || "xBF" < $str[$pos+1]) {
                
                            $char_size = 1;
                
                        } else {
                
                            $valid = true;
                            $char_size = 2;
                
                        }
                
                    } else if ($str[$pos] < "xF0") {
                
                        $left = "xE0" === $str[$pos] ? "xA0" : "x80";
                        $right = "xED" === $str[$pos] ? "x9F" : "xBF";
                
                        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {
                
                            $char_size = 1;
                
                        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "x80" || "xBF" < $str[$pos+2]) {
                
                            $char_size = 2;
                
                        } else {
                
                            $valid = true;
                            $char_size = 3;
                
                       }
                
                    } else if ($str[$pos] < "xF5") {
                
                        $left = "xF0" === $str[$pos] ? "x90" : "x80";
                        $right = "xF4" === $str[$pos] ? "x8F" : "xBF";
                
                        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {
                
                            $char_size = 1;
                
                        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "x80" || "xBF" < $str[$pos+2]) {
                
                            $char_size = 2;
                
                        } else if (!isset($str[$pos+3]) || $str[$pos+3] < "x80" || "xBF" < $str[$pos+3]) {
                
                            $char_size = 3;
                
                        } else {
                
                            $valid = true;
                            $char_size = 4;
                
                        }
                
                    } else {
                
                        $char_size = 1;
                
                    }
                
                    $char = substr($str, $pos, $char_size);
                    $pos += $char_size;
                
                    return true;
                }
                

                测试用例在这里.

                function run(array $callables, array $arguments)
                {
                    return array_map(function($callable) use($arguments) {
                         return array_map($callable, $arguments);
                    }, $callables);
                }
                    
                $data = [
                    // Table 3-8. Use of U+FFFD in UTF-8 Conversion
                    // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
                    "x61"."xF1x80x80"."xE1x80"."xC2"."x62"."x80"."x63"
                    ."x80"."xBF"."x64",
                
                    // 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence
                    "xF0x9Fx8Cx95"."xF0x9Fx8C"."xF0x9Fx8C"
                ];
                
                var_dump(run([
                    'replace_invalid_byte_sequence', 
                    'replace_invalid_byte_sequence2',
                    'replace_invalid_byte_sequence3',
                    'replace_invalid_byte_sequence4',
                    'replace_invalid_byte_sequence5',
                    'replace_invalid_byte_sequence6'
                ], $data));
                

                请注意,mb_convert_encoding 有一个错误,它会在无效字节序列之后立即中断有效字符,或者在不添加 U+FFFD 的情况下删除有效字符之后的无效字节序列.

                As a note, mb_convert_encoding has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding U+FFFD.

                $data = [
                    // U+20AC
                    "xE2x82xAC"."xE2x82xAC"."xE2x82xAC",
                    "xE2x82"    ."xE2x82xAC"."xE2x82xAC",
                
                    // U+24B62
                    "xF0xA4xADxA2"."xF0xA4xADxA2"."xF0xA4xADxA2",
                    "xF0xA4xAD"    ."xF0xA4xADxA2"."xF0xA4xADxA2",
                    "xA4xADxA2"."xF0xA4xADxA2"."xF0xA4xADxA2",
                
                    // 'FULL MOON SYMBOL' (U+1F315)
                    "xF0x9Fx8Cx95" . "xF0x9Fx8C",
                    "xF0x9Fx8Cx95" . "xF0x9Fx8C" . "xF0x9Fx8C"
                ];
                

                尽管 preg_match() 可以代替 preg_replace_callback 使用,但此函数对字节大小有限制.有关详细信息,请参阅错误报告 #36463.可以通过下面的测试用例来确认.

                Although preg_match() can be used intead of preg_replace_callback, this function has a limition on bytesize. See bug report #36463 for details. You can confirm it by the following test case.

                str_repeat('a', 10000)
                

                最后,我的基准测试结果如下.

                Finally, the result of my benchmark is following.

                mb_convert_encoding()
                0.19628190994263
                htmlspecialchars()
                0.082863092422485
                UConverter::transcode()
                0.15999984741211
                UConverter::convert()
                0.29843020439148
                preg_replace_callback()
                0.63967490196228
                direct comparision
                0.71933102607727
                

                基准代码在这里.

                function timer(array $callables, array $arguments, $repeat = 10000)
                {
                
                    $ret = [];
                    $save = $repeat;
                
                    foreach ($callables as $key => $callable) {
                
                        $start = microtime(true);
                
                        do {
                    
                            array_map($callable, $arguments);
                
                        } while($repeat -= 1);
                
                        $stop = microtime(true);
                        $ret[$key] = $stop - $start;
                        $repeat = $save;
                
                    }
                
                    return $ret;
                }
                
                $functions = [
                    'mb_convert_encoding()' => 'replace_invalid_byte_sequence',
                    'htmlspecialchars()' => 'replace_invalid_byte_sequence2',
                    'UConverter::transcode()' => 'replace_invalid_byte_sequence3',
                    'UConverter::convert()' => 'replace_invalid_byte_sequence4',
                    'preg_replace_callback()' => 'replace_invalid_byte_sequence5',
                    'direct comparision' => 'replace_invalid_byte_sequence6'
                ];
                
                foreach (timer($functions, $data) as $description => $time) {
                
                    echo $description, PHP_EOL,
                         $time, PHP_EOL;
                
                }
                

                这篇关于用问号替换无效的 UTF-8 字符,mbstring.substitute_character 似乎被忽略了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!

                上一篇:UTF-8 在 HTML 表单中不起作用 下一篇:PHP中如何判断字母是大写还是小写?

                相关文章

                最新文章

              • <legend id='lxeTM'><style id='lxeTM'><dir id='lxeTM'><q id='lxeTM'></q></dir></style></legend>
              • <i id='lxeTM'><tr id='lxeTM'><dt id='lxeTM'><q id='lxeTM'><span id='lxeTM'><b id='lxeTM'><form id='lxeTM'><ins id='lxeTM'></ins><ul id='lxeTM'></ul><sub id='lxeTM'></sub></form><legend id='lxeTM'></legend><bdo id='lxeTM'><pre id='lxeTM'><center id='lxeTM'></center></pre></bdo></b><th id='lxeTM'></th></span></q></dt></tr></i><div id='lxeTM'><tfoot id='lxeTM'></tfoot><dl id='lxeTM'><fieldset id='lxeTM'></fieldset></dl></div>
                    <bdo id='lxeTM'></bdo><ul id='lxeTM'></ul>

                  <small id='lxeTM'></small><noframes id='lxeTM'>

                    <tfoot id='lxeTM'></tfoot>