我需要确定我的输入属于哪种自然语言.目标是区分混合输入中的 阿拉伯语 和 英语 单词,其中输入是 Unicode 并从 XML 文本节点中提取.我注意到类 Character.UnicodeBlock.和我的问题有关吗?我怎样才能让它工作?
I need to identify what natural language my input belongs to.
The goal is to distinguish between Arabic and English words in a mixed input, where the input is Unicode and is extracted from XML text nodes.
I have noticed the class Character.UnicodeBlock. Is it related to my problem? How can I get it to work?
Character.UnicodeBlock 方法对阿拉伯语很有用,但显然不适用于英语(或其他欧洲语言),因为 BASIC_LATIN Unicode 块涵盖符号和不可打印字符和字母.所以现在我使用 String 对象的 matches() 方法和正则表达式 "[A-Za-z]+" 代替.我可以忍受它,但也许有人可以提出更好/更快的方法.
The Character.UnicodeBlock approach was useful for Arabic, but apparently doesn't do it for English (or other European languages) because the BASIC_LATIN Unicode block covers symbols and non-printable characters as well as letters.
So now I am using the matches() method of the String object with the regex expression "[A-Za-z]+" instead. I can live with it, but perhaps someone can suggest a nicer/faster way.
是的,你可以简单地使用 Character.UnicodeBlock.of(char)
Yes, you can simply use Character.UnicodeBlock.of(char)
这篇关于Java:如何检查字符是否属于特定的 unicode 块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!
“Char 不能被取消引用"错误quot;Char cannot be dereferencedquot; error(“Char 不能被取消引用错误)
Java Switch 语句 - 是“或"/“和"可能的?Java Switch Statement - Is quot;orquot;/quot;andquot; possible?(Java Switch 语句 - 是“或/“和可能的?)
Java替换字符串特定位置的字符?Java Replace Character At Specific Position Of String?(Java替换字符串特定位置的字符?)
具有 int 和 char 操作数的三元表达式的类型是什么What is the type of a ternary expression with int and char operands?(具有 int 和 char 操作数的三元表达式的类型是什么?)
读取文本文件并存储出现的每个字符Read a text file and store every single character occurrence(读取文本文件并存储出现的每个字符)
为什么我需要在 byte 和 short 上显式转换 char 原语Why do I need to explicitly cast char primitives on byte and short?(为什么我需要在 byte 和 short 上显式转换 char 原语?)