为了账号安全,请及时绑定邮箱和手机立即绑定

将 n 元语法与组重复项进行比较

将 n 元语法与组重复项进行比较

偶然的你 2023-08-08 16:00:15
我正在编写一个脚本,如果两行之间的三个连续单词匹配,该脚本将认为两行是重复的。假设我当前的数据集是:1 A Course of Pure Mathematics by G. H. Hardy2 Agile Software Development, Principles, Patterns, and Practices by Robert C. Martin3 Advanced Programming in the UNIX Environment, 3rd Edition4 Advanced Selling Strategies: Brian Tracy5 Advanced Programming in the UNIX(R) Environment6 Alex's Adventures in Numberland: Dispatches from the Wonderful World of Mathematics by Alex Bellos, Andy Riley7 Advertising Secrets of the Written Word: The Ultimate Resource on How to Write Powerful Advertising8 Agile Software Development, Principles, Patterns, and Practices9 A Course of Pure Mathematics (Cambridge Mathematical Library) 10th Edition by G. H. Hardy 10 Alex’s Adventures in Numberland11 Advertising Secrets of the Written Word12 Alex's Adventures in Numberland Paperback by Alex Bellos这里,1 和 9 是重复的,因为course pure mathematics匹配。2 和 8 是重复的,因为advanced programming unix匹配。3 和 5 是重复的,因为advanced programming unix匹配。等等 ...
查看完整描述

1 回答

?
宝慕林4294392

TA贡献2021条经验 获得超8个赞

OP 这里,解决方案似乎是:


import re

from nltk.util import ngrams


OriginalBooksList = list()

booksAfterRemovingStopWords = list()

booksWithNGrams = list()

stopWords = ['I', 'a', 'about', 'an', 'are', 'as', 'at', 'be', 'by', 'com', 'for', 'from', 'how', 'in', 'is', 'it', 'of', 'on', 'or', 'that', 'the', 'this', 'to', 'was', 'the',

             'and', 'A', 'About', 'An', 'Are', 'As', 'At', 'Be', 'By', 'Com', 'For', 'From', 'How', 'In', 'Is', 'It', 'Of', 'On', 'Or', 'That', 'The', 'This', 'To', 'Was', 'The', 'And']


with open('UnifiedBookList.txt') as fin:

    for line_no, line in enumerate(fin):

        OriginalBooksList.append(line)

        line = re.sub(r'[^\w\s]', ' ', line)  # replace punctuation with space

        line = re.sub(' +', ' ', line)  # replace multiple space with one

        line = line.lower()  # to lower case

        if line.strip() and len(line.split()) > 2:  # line can not be empty and line must have more than 2 words

            booksAfterRemovingStopWords.append(' '.join([i for i in line.split(

            ) if i not in stopWords]))  # Remove Stop Words And Make Sentence



for line_no, line in enumerate(booksAfterRemovingStopWords):

    tokens = line.split(" ")

    output = list(ngrams(tokens, 3))

    temp = list()


    temp.append(OriginalBooksList[line_no])  # Adding original line

    for x in output:  # Adding n-grams

        temp.append(' '.join(x))

    booksWithNGrams.append(temp)


while booksWithNGrams:

    first_element = booksWithNGrams.pop(0)

    x = 0

    for mylist in booksWithNGrams:

        if set(first_element) & set(mylist):

            if x == 0:

                print(first_element[0])

                x = 1

                # print(set(first_element) & set(mylist))

            print(mylist[0])

            booksWithNGrams.remove(mylist)

    x = 0


查看完整回答
反对 回复 2023-08-08
  • 1 回答
  • 0 关注
  • 92 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信