为了账号安全,请及时绑定邮箱和手机立即绑定

如何独立于语言环境在字符串中查找括号?

如何独立于语言环境在字符串中查找括号?

胡子哥哥 2023-05-10 13:22:02
我需要在 Java String 中找到第一对完整的括号,如果它是非嵌套的,则返回其内容。当前的问题是括号可能在不同的区域设置/语言中由不同的字符表示。我的第一个想法当然是使用正则表达式。但是,如果使用类似 "\((.*)\)" 的东西,要确保当前考虑的匹配中没有嵌套括号似乎相当困难(至少对我而言),似乎没有Java 匹配器中可用的类括号字符。因此,我试图更命令地解决问题,但偶然发现我需要处理的数据是不同语言的问题,并且根据语言环境的不同,括号中的字符也不同。西文: (), 中文 (Locale "zh"): ()package main;import java.io.BufferedReader;import java.io.IOException;import java.io.StringReader;import java.util.HashSet;import java.util.Set;public class FindParentheses {    static public Set<String> searchNames(final String string) throws IOException {        final Set<String> foundName = new HashSet<>();        final BufferedReader stringReader = new BufferedReader(new StringReader(string));        for (String line = stringReader.readLine(); line != null; line = stringReader.readLine()) {            final int indexOfFirstOpeningBrace = line.indexOf('(');            if (indexOfFirstOpeningBrace > -1) {                final String afterFirstOpeningParenthesis = line.substring(indexOfFirstOpeningBrace + 1);                final int indexOfNextOpeningParenthesis = afterFirstOpeningParenthesis.indexOf('(');                final int indexOfNextClosingParenthesis = afterFirstOpeningParenthesis.indexOf(')');                /*                 * If the following condition is fulfilled, there is a simple braced expression                 * after the found product's short name. Otherwise, there may be an additional                 * nested pair of braces, or the closing brace may be missing, in which cases the                 * expression is rejected as a product's long name.                 */第二个带中文括号的东西没有找到,但是应该有。当然,我可能会匹配这些字符作为额外的特例,但由于我的项目使用 23 种语言,包括韩语和日语,我更喜欢找到任何一对括号的解决方案。
查看完整描述

3 回答

?
阿晨1998

TA贡献2037条经验 获得超6个赞

您可以使用\p{Ps} Punctuation, Open and \p{Pe}Punctuation, Close , Unicode 类。

String par_paired_punct = "\\p{Ps}([^\\p{Ps}\\p{Pe}]*)\\p{Pe}";

它们比括号更匹配,但您可以“手动”排除不需要的字符。


在 Punctuation, Open 类中,以下字符不是左括号或圆括号:


U+0F3A  TIBETAN MARK GUG RTAGS GYON ༺   

U+0F3C  TIBETAN MARK ANG KHANG GYON ༼   

U+169B  OGHAM FEATHER MARK  ᚛   

U+201A  SINGLE LOW-9 QUOTATION MARK ‚   

U+201E  DOUBLE LOW-9 QUOTATION MARK „   

U+27C5  LEFT S-SHAPED BAG DELIMITER ⟅   

U+29D8  LEFT WIGGLY FENCE   ⧘   

U+29DA  LEFT DOUBLE WIGGLY FENCE    ⧚   

U+2E42  DOUBLE LOW-REVERSED-9 QUOTATION MARK    ⹂   

U+301D  REVERSED DOUBLE PRIME QUOTATION MARK    〝   

U+FD3F  ORNATE RIGHT PARENTHESIS    ﴿   

在 Punctuation, Close 类中,以下不是成对的括号字符:


U+0F3B  TIBETAN MARK GUG RTAGS GYAS ༻   

U+0F3D  TIBETAN MARK ANG KHANG GYAS ༽   

U+169C  OGHAM REVERSED FEATHER MARK ᚜   

U+27C6  RIGHT S-SHAPED BAG DELIMITER    ⟆   

U+29D9  RIGHT WIGGLY FENCE  ⧙   

U+29DB  RIGHT DOUBLE WIGGLY FENCE   ⧛

U+301E  DOUBLE PRIME QUOTATION MARK 〞

U+301F  LOW DOUBLE PRIME QUOTATION MARK 〟   

U+FD3E  ORNATE LEFT PARENTHESIS ﴾   

正则表达式看起来像


String par_rx = "[\\p{Ps}&&[^\\u0F3\\u0F3C\\u169B\\u201A\\u201E\\u27C5\\u29D8\\u29DA\\u2E42\\u301D\\uFD3F]]" +

                 "((?:[^\\p{Ps}\\p{Pe}]|[\\u0F3\\u0F3C\\u169B\\u201A\\u201E\\u27C5\\u29D8\\u29DA\\u2E42\\u301D\\uFD3F\\u0F3B\\u0F3D\\u169C\\u27C6\\u29D9\\u29DB\\u301E\\u301F\\uFD3E])*)" +

                 "[\\p{Pe}&&[^\\u0F3B\\u0F3D\\u169C\\u27C6\\u29D9\\u29DB\\u301E\\u301F\\uFD3E]]";



查看完整回答
反对 回复 2023-05-10
?
哆啦的时光机

TA贡献1779条经验 获得超6个赞

我猜你可能想设计一个表达式,可能类似于:

[((]\s*([^))]*)\s*[))]

在这些 char 类中,您想要的括号将放在何处:

[((]

测试

import java.util.regex.Matcher;

import java.util.regex.Pattern;



public class re{

    public static void main(String[] args){

        final String regex = "[((]\\s*([^))]*)\\s*[))]";

        final String string = "Something meaningful: shortName1 (LongName 1) Localization issue here: shortName2 (保险丝2). This one should be found, too. Easy again: shortName3 (LongName 3). Yet more random text... Something meaningful: shortName1 (LongName 1) Localization issue here: shortName2 (保险丝2). This one should be found, too. Easy again: shortName3 (LongName 3). Yet more random text...";


        final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);

        final Matcher matcher = pattern.matcher(string);


        while (matcher.find()) {

            System.out.println("Full match: " + matcher.group(0));

            for (int i = 1; i <= matcher.groupCount(); i++) {

                System.out.println("Group " + i + ": " + matcher.group(i));

            }

        }

    }

}

输出

Full match: (LongName 1)

Group 1: LongName 1

Full match: (保险丝2)

Group 1: 保险丝2

Full match: (LongName 3)

Group 1: LongName 3

Full match: (LongName 1)

Group 1: LongName 1

Full match: (保险丝2)

Group 1: 保险丝2

Full match: (LongName 3)

Group 1: LongName 3

另一种选择是:


(?<=[((])[^))]*(?=[))])    

这将输出:


Full match: LongName 1

Full match: 保险丝2

Full match: LongName 3

Full match: LongName 1

Full match: 保险丝2

Full match: LongName 3


查看完整回答
反对 回复 2023-05-10
?
翻翻过去那场雪

TA贡献2065条经验 获得超14个赞

我用它来枚举所有相关字符;就我而言,所有括号都很有趣。

此外,我更愿意确保只考虑匹配的括号,这导致我在 Java 中使用了这个丑陋的正则表达式:

public static final String ANY_PARENTHESES = "\\([^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+\\)|⁽[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⁾|₍[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+₎|❨[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+❩|❪[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+❫|⟮[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⟯|⦅[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⦆|⸨[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⸩|﴾[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+﴿|︵[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+︶|﹙[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+﹚|([^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+)|⦅[^\\(⁽₍❨❪⟮⦅⸨﴾︵﹙(⦅\\)⁾₎❩❫⟯⦆⸩﴿︶﹚)⦆]+⦆";

我实际上是用以下代码构造的:

我实际上是用以下代码构造的:


    public static final char LEFT_PARENTHESIS = '\u0028', // (

        SUPERSCRIPT_LEFT_PARENTHESIS = '\u207D', // ⁽

        SUBSCRIPT_LEFT_PARENTHESIS = '\u208D', // ₍

        MEDIUM_LEFT_PARENTHESIS_ORNAMENT = '\u2768', // ❨

        MEDIUM_FLATTENED_LEFT_PARENTHESIS_ORNAMENT = '\u276A', // ❪

        MATHEMATICAL_LEFT_FLATTENED_PARENTHESIS = '\u27EE', // ⟮

        LEFT_WHITE_PARENTHESIS = '\u2985', // ⦅

        LEFT_DOUBLE_PARENTHESIS = '\u2E28', // ⸨

        ORNATE_LEFT_PARENTHESIS = '\uFD3E', // ﴾

        PRESENTATION_FORM_FOR_VERTICAL_LEFT_PARENTHESIS = '\uFE35', // ︵

        SMALL_LEFT_PARENTHESIS = '\uFE59', // ﹙

        FULLWIDTH_LEFT_PARENTHESIS = '\uFF08', // (

        FULLWIDTH_LEFT_WHITE_PARENTHESIS = '\uFF5F'; // ⦅


    public static final char RIGHT_PARENTHESIS = '\u0029', // )

        SUPERSCRIPT_RIGHT_PARENTHESIS = '\u207E', // ⁾

        SUBSCRIPT_RIGHT_PARENTHESIS = '\u208E', // ₎

        MEDIUM_RIGHT_PARENTHESIS_ORNAMENT = '\u2769', // ❩

        MEDIUM_FLATTENED_RIGHT_PARENTHESIS_ORNAMENT = '\u276B', // ❫

        MATHEMATICAL_RIGHT_FLATTENED_PARENTHESIS = '\u27EF', // ⟯

        RIGHT_WHITE_PARENTHESIS = '\u2986', // ⦆

        RIGHT_DOUBLE_PARENTHESIS = '\u2E29', // ⸩

        ORNATE_RIGHT_PARENTHESIS = '\uFD3F', // ﴿

        PRESENTATION_FORM_FOR_VERTICAL_RIGHT_PARENTHESIS = '\uFE36', // ︶

        SMALL_RIGHT_PARENTHESIS = '\uFE5A', // ﹚

        FULLWIDTH_RIGHT_PARENTHESIS = '\uFF09', // )

        FULLWIDTH_RIGHT_WHITE_PARENTHESIS = '\uFF60'; // ⦆


    public static final String NO_PARENTHESES = "[^\\" + LEFT_PARENTHESIS + SUPERSCRIPT_LEFT_PARENTHESIS

        + SUBSCRIPT_LEFT_PARENTHESIS + MEDIUM_LEFT_PARENTHESIS_ORNAMENT + MEDIUM_FLATTENED_LEFT_PARENTHESIS_ORNAMENT

        + MATHEMATICAL_LEFT_FLATTENED_PARENTHESIS + LEFT_WHITE_PARENTHESIS + LEFT_DOUBLE_PARENTHESIS

        + ORNATE_LEFT_PARENTHESIS + PRESENTATION_FORM_FOR_VERTICAL_LEFT_PARENTHESIS + SMALL_LEFT_PARENTHESIS

        + FULLWIDTH_LEFT_PARENTHESIS + FULLWIDTH_LEFT_WHITE_PARENTHESIS + "\\" + RIGHT_PARENTHESIS

        + SUPERSCRIPT_RIGHT_PARENTHESIS + SUBSCRIPT_RIGHT_PARENTHESIS + MEDIUM_RIGHT_PARENTHESIS_ORNAMENT

        + MEDIUM_FLATTENED_RIGHT_PARENTHESIS_ORNAMENT + MATHEMATICAL_RIGHT_FLATTENED_PARENTHESIS

        + RIGHT_WHITE_PARENTHESIS + RIGHT_DOUBLE_PARENTHESIS + ORNATE_RIGHT_PARENTHESIS

        + PRESENTATION_FORM_FOR_VERTICAL_RIGHT_PARENTHESIS + SMALL_RIGHT_PARENTHESIS + FULLWIDTH_RIGHT_PARENTHESIS

        + FULLWIDTH_RIGHT_WHITE_PARENTHESIS + "]+";


    public static final String PARENTHESES = "\\" + LEFT_PARENTHESIS + NO_PARENTHESES + "\\" + RIGHT_PARENTHESIS;


    public static final String SUPERSCRIPT_PARENTHESES =

        "" + SUPERSCRIPT_LEFT_PARENTHESIS + NO_PARENTHESES + SUPERSCRIPT_RIGHT_PARENTHESIS;


    public static final String SUBSCRIPT_PARENTHESES =

        "" + SUBSCRIPT_LEFT_PARENTHESIS + NO_PARENTHESES + SUBSCRIPT_RIGHT_PARENTHESIS;


    public static final String MEDIUM_PARENTHESES_ORNAMENT =

        "" + MEDIUM_LEFT_PARENTHESIS_ORNAMENT + NO_PARENTHESES + MEDIUM_RIGHT_PARENTHESIS_ORNAMENT;


    public static final String MEDIUM_FLATTENED_PARENTHESES_ORNAMENT =

        "" + MEDIUM_FLATTENED_LEFT_PARENTHESIS_ORNAMENT + NO_PARENTHESES + MEDIUM_FLATTENED_RIGHT_PARENTHESIS_ORNAMENT;


    public static final String MATHEMATICAL_FLATTENED_PARENTHESES =

        "" + MATHEMATICAL_LEFT_FLATTENED_PARENTHESIS + NO_PARENTHESES + MATHEMATICAL_RIGHT_FLATTENED_PARENTHESIS;


    public static final String WHITE_PARENTHESES =

        "" + LEFT_WHITE_PARENTHESIS + NO_PARENTHESES + RIGHT_WHITE_PARENTHESIS;


    public static final String DOUBLE_PARENTHESES =

        "" + LEFT_DOUBLE_PARENTHESIS + NO_PARENTHESES + RIGHT_DOUBLE_PARENTHESIS;


    public static final String ORNATE_PARENTHESES =

        "" + ORNATE_LEFT_PARENTHESIS + NO_PARENTHESES + ORNATE_RIGHT_PARENTHESIS;


    public static final String PRESENTATION_FORM_FOR_VERTICAL_PARENTHESES =

        "" + PRESENTATION_FORM_FOR_VERTICAL_LEFT_PARENTHESIS + NO_PARENTHESES

        + PRESENTATION_FORM_FOR_VERTICAL_RIGHT_PARENTHESIS;


    public static final String SMALL_PARENTHESES =

        "" + SMALL_LEFT_PARENTHESIS + NO_PARENTHESES + SMALL_RIGHT_PARENTHESIS;


    public static final String FULLWIDTH_PARENTHESES =

        "" + FULLWIDTH_LEFT_PARENTHESIS + NO_PARENTHESES + FULLWIDTH_RIGHT_PARENTHESIS;


    public static final String FULLWIDTH_WHITE_PARENTHESES =

        "" + FULLWIDTH_LEFT_WHITE_PARENTHESIS + NO_PARENTHESES + FULLWIDTH_RIGHT_WHITE_PARENTHESIS;


    public static final char XOR = '|';


    public static final String ANY_PARENTHESES = PARENTHESES

        + XOR + SUPERSCRIPT_PARENTHESES

        + XOR + SUBSCRIPT_PARENTHESES

        + XOR + MEDIUM_PARENTHESES_ORNAMENT

        + XOR + MEDIUM_FLATTENED_PARENTHESES_ORNAMENT

        + XOR + MATHEMATICAL_FLATTENED_PARENTHESES

        + XOR + WHITE_PARENTHESES

        + XOR + DOUBLE_PARENTHESES

        + XOR + ORNATE_PARENTHESES

        + XOR + PRESENTATION_FORM_FOR_VERTICAL_PARENTHESES

        + XOR + SMALL_PARENTHESES

        + XOR + FULLWIDTH_PARENTHESES

        + XOR + FULLWIDTH_WHITE_PARENTHESES;

但请注意,它不拒绝嵌套的括号。


查看完整回答
反对 回复 2023-05-10
  • 3 回答
  • 0 关注
  • 121 浏览

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信