首页猿问将 n 元语法与组重复项进行比较

将 n 元语法与组重复项进行比较

Python

偶然的你 2023-08-08 16:00:15

我正在编写一个脚本，如果两行之间的三个连续单词匹配，该脚本将认为两行是重复的。假设我当前的数据集是：1 A Course of Pure Mathematics by G. H. Hardy2 Agile Software Development, Principles, Patterns, and Practices by Robert C. Martin3 Advanced Programming in the UNIX Environment, 3rd Edition4 Advanced Selling Strategies: Brian Tracy5 Advanced Programming in the UNIX(R) Environment6 Alex's Adventures in Numberland: Dispatches from the Wonderful World of Mathematics by Alex Bellos, Andy Riley7 Advertising Secrets of the Written Word: The Ultimate Resource on How to Write Powerful Advertising8 Agile Software Development, Principles, Patterns, and Practices9 A Course of Pure Mathematics (Cambridge Mathematical Library) 10th Edition by G. H. Hardy 10 Alex’s Adventures in Numberland11 Advertising Secrets of the Written Word12 Alex's Adventures in Numberland Paperback by Alex Bellos这里，1 和 9 是重复的，因为course pure mathematics匹配。2 和 8 是重复的，因为advanced programming unix匹配。3 和 5 是重复的，因为advanced programming unix匹配。等等 ...

查看完整描述

1 回答

宝慕林4294392

TA贡献2021条经验获得超8个赞

OP 这里，解决方案似乎是：

import re

from nltk.util import ngrams

OriginalBooksList = list()

booksAfterRemovingStopWords = list()

booksWithNGrams = list()

stopWords = ['I', 'a', 'about', 'an', 'are', 'as', 'at', 'be', 'by', 'com', 'for', 'from', 'how', 'in', 'is', 'it', 'of', 'on', 'or', 'that', 'the', 'this', 'to', 'was', 'the',

'and', 'A', 'About', 'An', 'Are', 'As', 'At', 'Be', 'By', 'Com', 'For', 'From', 'How', 'In', 'Is', 'It', 'Of', 'On', 'Or', 'That', 'The', 'This', 'To', 'Was', 'The', 'And']

with open('UnifiedBookList.txt') as fin:

for line_no, line in enumerate(fin):

OriginalBooksList.append(line)

line = re.sub(r'[^\w\s]', ' ', line) # replace punctuation with space

line = re.sub(' +', ' ', line) # replace multiple space with one

line = line.lower() # to lower case

if line.strip() and len(line.split()) > 2: # line can not be empty and line must have more than 2 words

booksAfterRemovingStopWords.append(' '.join([i for i in line.split(

) if i not in stopWords])) # Remove Stop Words And Make Sentence

for line_no, line in enumerate(booksAfterRemovingStopWords):

tokens = line.split(" ")

output = list(ngrams(tokens, 3))

temp = list()

temp.append(OriginalBooksList[line_no]) # Adding original line

for x in output: # Adding n-grams

temp.append(' '.join(x))

booksWithNGrams.append(temp)

while booksWithNGrams:

first_element = booksWithNGrams.pop(0)

x = 0

for mylist in booksWithNGrams:

if set(first_element) & set(mylist):

if x == 0:

print(first_element[0])

x = 1

# print(set(first_element) & set(mylist))

print(mylist[0])

booksWithNGrams.remove(mylist)

x = 0

反对回复 2023-08-08

1 回答
0 关注
100 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

将 n 元语法与组重复项进行比较

将 n 元语法与组重复项进行比较

1 回答

添加回答