为了账号安全,请及时绑定邮箱和手机立即绑定

使用正则表达式匹配成绩单中的名称、对话和动作

使用正则表达式匹配成绩单中的名称、对话和动作

有只小跳蛙 2021-09-25 14:37:24
给定如下所示的字符串对话,我需要找到与每个用户对应的句子。text = 'CHRIS: Hello, how are you...PETER: Great, you? PAM: He is resting.[PAM SHOWS THE COUCH][PETER IS NODDING HIS HEAD]CHRIS: Are you ok?'对于上述对话,我想返回包含三个元素的元组:人名小写的句子和括号内的句子像这样的东西:('CHRIS', 'Hello, how are you...', None)('PETER', 'Great, you?', None)('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')('CHRIS', 'Are you ok?', None)etc...我正在尝试使用正则表达式来实现上述目的。到目前为止,我能够使用以下代码获取用户的姓名。我正在努力识别两个用户之间的句子。actors = re.findall(r'\w+(?=\s*:[^/])',text)
查看完整描述

3 回答

?
蛊毒传说

TA贡献1895条经验 获得超3个赞

正则表达式是解决此问题的一种方法,但您也可以将其视为遍历文本中的每个标记并应用一些逻辑来形成组。


例如,我们可以先找到一组名称和文本:


from itertools import groupby


def isName(word):

    # Names end with ':'

    return word.endswith(":")


text_split = [

    " ".join(list(g)).rstrip(":") 

    for i, g in groupby(text.replace("]", "] ").split(), isName)

]

print(text_split)

#['CHRIS',

# 'Hello, how are you...',

# 'PETER',

# 'Great, you?',

# 'PAM',

# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',

# 'CHRIS',

# 'Are you ok?']

接下来,您可以将成对的连续元素收集text_split到元组中:


print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])

#[('CHRIS', 'Hello, how are you...'),

# ('PETER', 'Great, you?'),

# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),

# ('CHRIS', 'Are you ok?')]

我们几乎达到了所需的输出。我们只需要处理方括号中的文本。您可以为此编写一个简单的函数。(诚然,正则表达式是这里的一个选项,但我在这个答案中故意避免这样做。)


这是我想出的快速方法:


def isClosingBracket(word):

    return word.endswith("]")


def processWords(words):

    if "[" not in words:

        return [words, None]

    else:

        return [

            " ".join(g).replace("]", ".") 

            for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)

        ]


print(

    [(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]

)

#[('CHRIS', 'Hello, how are you...', None),

# ('PETER', 'Great, you?', None),

# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),

# ('CHRIS', 'Are you ok?', None)]

请注意,使用 将*的结果解包processWords到tuple严格来说是python 3 的功能。


查看完整回答
反对 回复 2021-09-25
?
守候你守候我

TA贡献1802条经验 获得超10个赞

你可以这样做re.findall:


>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)

[('CHRIS', ' Hello, how are you...', ''),

 ('PETER', ' Great, you? ', ''),

 ('PAM',

  ' He is resting.',

  '[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),

 ('CHRIS', ' Are you ok?', '')]

您将必须弄清楚如何自己删除方括号,这在仍然尝试匹配所有内容的同时使用正则表达式无法完成。


正则表达式分解


\b              # Word boundary

(\S+)           # First capture group, string of characters not having a space

:               # Colon

(               # Second capture group

    [^          # Match anything that is not...

        :       #     a colon

        \[\]    #     or square braces

    ]+?         # Non-greedy match

)

\n?             # Optional newline

(               # Third capture group

    \[          # Literal opening brace

    [^:]+?      # Similar to above - exclude colon from match

    \] 

    \n?         # Optional newlines

)?              # Third capture group is optional

(?=             # Lookahead for... 

    \b          #     a word boundary, followed by  

    \S+         #     one or more non-space chars, and

    :           #     a colon

    |           # Or,

    $           # EOL

)


查看完整回答
反对 回复 2021-09-25
  • 3 回答
  • 0 关注
  • 253 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信