为了账号安全,请及时绑定邮箱和手机立即绑定

使用 Python 从语义上检测文本块

使用 Python 从语义上检测文本块

桃花长相依 2021-10-19 15:09:30
我有这个示例日志文本块:20190122 09:00,000 ###PERFORMANCE string1 string2 string320190122 09:10,500 number1 string1 string2 string320190122 09:24,670 number2 string1 string2 string320190122 10:05,000 number3 string1 string2 string320190122 10:33,960 number4 string1 string2 string320190122 11:00,321 number5 string1 string2 string320190122 11:40,256 ###PERFORMANCE string1 string2 string320190123 10:24,670 number1 string1 string2 string3 string4 date1 number220190123 10:32,130 number1 string1 string2 string3 string4 date1 number220190123 08:00,000 ###PERFORMANCE string1 string2 string320190123 08:10,500 number1 string1 string2 string320190123 08:24,670 number2 string1 string2 string320190123 09:05,000 number3 string1 string2 string320190123 10:33,960 number4 string1 string2 string320190123 10:00,321 number5 string1 string2 string320190123 13:40,256 ###PERFORMANCE string1 string2 string320190124 10:00,000 ###PERFORMANCE string1 string2 string320190124 10:10,500 number1 string1 string2 string320190124 10:24,670 number2 string1 string2 string320190124 11:05,000 number3 string1 string2 string320190124 12:33,960 number4 string1 string2 string320190124 13:00,321 number5 string1 string2 string320190124 13:40,256 ###PERFORMANCE string1 string2 string3我想用 Python 做的是检测每个###PERFORMANCE文本块,如本例所示:如您所见,有 3 个感兴趣的块,每个块都由###PERFORMANCE字符串中的文本分隔。第一个从第 1 行开始到第 7 行结束。第 7 行和第 10 行之间的内容不能被视为感兴趣的块。每个块的字符串行也可能不同(所以按行号不是一个好主意)。到目前为止,我所做的只是逐行读取文本文件:logFile = "testLog.txt"with open(logFile) as f:    content = f.readlines()# you may also want to remove whitespace characters like `\n` at the end of each linecontent = [x.strip() for x in content]for line in content:    print(line)我可以通过哪种方式来完成这项任务?使用 NLTK 是个好主意吗?它甚至适用于这项任务吗?任何一般建议?
查看完整描述

2 回答

?
一只萌萌小番薯

TA贡献1795条经验 获得超7个赞

我认为您可以通过简单的检查来完成所需的工作。让我解释一下我是否正确理解。你可以有一个标志(真/假值)来检测你是否在有趣的块中。每当您找到“###PERFORMANCE”时,您都可以更改此标志。然后您可以将这两个块保存在两个列表或您喜欢的任何结构中。


下面是代码片段


logFile = "logfile.txt"


with open(logFile) as f:

    content = f.readlines()

# you may also want to remove whitespace characters like `\n` at the end of each line

content = [x.strip() for x in content]


# flag

are_we_in_the_interesting_block = False;


# two lists to save the liens

interesting_block = [];

non_interesting_block = [];


for line in content:

    # check if there is the text ###PERFORMANCE

    is_there_performance = line.find('###PERFORMANCE');


    # if it's not there, it returns -1

    if is_there_performance > 0:

        are_we_in_the_interesting_block = not are_we_in_the_interesting_block;

    else:    

        if are_we_in_the_interesting_block:

            # here I append to a list, but you can do your processing

            interesting_block.append(line);

        else:

            # here processing of the non interesting parts

            non_interesting_block.append(line);


print('Interesting blocks')

print(interesting_block)


print('\n')

print('Non interesting blocks')

print(non_interesting_block)

产生的输出将是


Interesting blocks

['20190122 09:10,500 number1 string1 string2 string3', '20190122 09:24,670 number2 string1 string2 string3', '20190122 10:05,000 number3 string1 string2 string3', '20190122 10:33,960 number4 string1 string2 string3', '20190122 11:00,321 number5 string1 string2 string3', '20190123 08:10,500 number1 string1 string2 string3', '20190123 08:24,670 number2 string1 string2 string3', '20190123 09:05,000 number3 string1 string2 string3', '20190123 10:33,960 number4 string1 string2 string3', '20190123 10:00,321 number5 string1 string2 string3', '20190124 10:10,500 number1 string1 string2 string3', '20190124 10:24,670 number2 string1 string2 string3', '20190124 11:05,000 number3 string1 string2 string3', '20190124 12:33,960 number4 string1 string2 string3', '20190124 13:00,321 number5 string1 string2 string3']



Non interesting blocks

['20190123 10:24,670 number1 string1 string2 string3 string4 date1 number2', '20190123 10:32,130 number1 string1 string2 string3 string4 date1 number2']

然后,interesting_block[n]如果需要,您可以访问以获取第 n 行。


查看完整回答
反对 回复 2021-10-19
?
慕后森

TA贡献1802条经验 获得超5个赞

由于您只是在 PERFORMANCE 分隔符上进行匹配,因此使用 NLTK 似乎有点过分。一个简单的方法是使用一个简单的匹配(是行中的预期字符串),然后根据它切换您的捕获模式。例如:


in_block = False

IDENTIFIER = 'PERFORMANCE'

with open(logfile) as f:

    for line in f.readlines():

        if IDENTIFIER in line:

            # Toggle the boolean

            in_block = not in_block

        if in_block:

            print(line)


查看完整回答
反对 回复 2021-10-19
  • 2 回答
  • 0 关注
  • 185 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信