首页猿问如何使用 Python 删除...

如何使用 Python 删除 JavaScript 和其他标签...而不导入模块

Python

神不在的星期二 2023-09-19 14:15:07

对于学校项目的第一部分，我试图弄清楚如何删除 JavaScript<script {...} >和</script {...} >标签以及<和之间的任何内容>。然而，我们无法导入任何模块（甚至是Python内置的模块），因为显然标记可能无法访问它们等等。我试过这个：text = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"while text.find("<script") >= 0: script_start = text.find("<script") script_end = text.find(">", text.find("</script")) + 1 text = text[:script_start] + text[script_end:]while text.find("<") >= 0: script2_start = text.find("<") script2_end = text.find(">") + 1 text = text[:script2_start] + text[script2_end:]这确实适用于较小的文件，但该项目与大文本文件有关（我们给出的简化测试文件是 10.4MB），因此它不会完成并且会卡住。有人有任何想法可以提高效率吗？

查看完整描述

3 回答

大话西游666

TA贡献1817条经验获得超14个赞

您不需要删除任何内容。事实上：你永远不想修改字符串。

字符串是不可变的：每次“修改”字符串时，您都会创建一个新字符串并丢弃旧字符串。这是对处理器和内存的浪费。

您正在对文件进行操作 - 因此请按字符方式处理它：

记住你是否在<...>里面
如果是这样，唯一重要的特征就是 >再次出去
如果外面和字符是<你进入里面并忽略该字符
如果在外部而不是在外部，<则将字符写入输出（-file）

# create file

with open("somefile.txt","w") as f:

# up the multiplicator to 10000000 to create something in the megabyte range

f.write("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n"*10)

# open file to read from and file to write to

with open("somefile.txt") as f, open("otherfile.txt","w") as out:

# starting outside

inside = False

# we iterate the file line by line

for line in f:

# and each line characterwise

for c in line:

if not inside and c == "<":

inside = True

elif inside and c != ">":

continue

elif inside and c == ">":

inside = False

elif not inside:

# only case to write to out

out.write(c)

print(open("somefile.txt").read() + "\n")

print(open("otherfile.txt").read())

输出：

hello hello hey tata

如果不允许直接操作文件，请将文件读入消耗 11+Mbyte 内存的列表中：

data = list("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n" * 10)

result = []

inside = False

for c in data:

if inside:

if c == ">":

inside = False

# else ignore c - because we are inside

elif c == "<":

inside = True

else:

result.append(c)

print(''.join(result))

这仍然比迭代搜索列表中第一次出现的“<”更好，但可能需要最多两倍的源内存（如果它不包含任何 <..>，则将列表加倍）。

操作文件比进行任何就地列表修改（这将是第三种方法）的内存效率要高得多。

您还需要解决一些明显的问题，例如

var i = 10;

if (i < 5) {

// some code

}

</script>

会将“代码”留在里面。

这可能会解决更简单的极端情况：

# open file to read from and file to write to

with open("somefile.txt") as f, open("otherfile.txt","w") as out:

# starting outside

inside = False

insideJS = False

jsStart = 0

# we iterate the file line by line

for line in f:

# string manipulation :/ - will remove <script ...> .. </script ..>

# even over multiple lines - probably missed some cornercases.

while True:

if insideJS and not "</script" in line:

line = ""

break

if "<script" in line:

insideJS = True

jsStart = line.index("<script")

jsEnd = len(line)

elif insideJS:

jsStart = 0

if not insideJS:

break

if "</script" in line:

jsEnd = line.index(">", line.index("</script", jsStart))+1

line = line[:jsStart] + line[jsEnd:]

insideJS = False

else:

line = line[:jsStart]

# and each line characterwise

for c in line:

# ... same as above ...

反对回复 2023-09-19

偶然的你

TA贡献1841条经验获得超3个赞

即使有2个while循环，它仍然是线性复杂度

string = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"

new_string = ''

i = 0

while i < len(string):

if string[i] == "<":

while i < len(string):

i += 1

if string[i] == '>':

break

else:

new_string += string[i]

i += 1

print(new_string)

输出：

hello hello hey

反对回复 2023-09-19

呼唤远方

TA贡献1856条经验获得超11个赞

以下是FSA的一种方法：

output = ''

NORMAL, INSIDE_TAG = range(2) # availale states

state = NORMAL # start with normal state

s = '<script beep beep> hello </script boop doop woop> hello <hi id="someid" class="some class"><a> hey </a><bye>'

for char in s:

if char == '<': # if we encounter '<' we enter the INSIDE_TAG state

state = INSIDE_TAG

continue

elif char == '>': # we can safely exit the INSIDE_TAG state

state = NORMAL

continue

if state == NORMAL:

output += char # add the char to the output only if we are in normal state

print(output)

如果需要解析标签语义，请确保使用堆栈（可以使用实现list）。

这会增加复杂性，但您可以使用 FSM 实现可靠的检查。

请参见以下示例：

output = ''

(

NORMAL,

TAG_ATTRIBUTE,

INSIDE_JAVASCRIPT,

EXITING_TAG,

BEFORE_TAG_OPENING_OR_ENDING,

TAG_NAME,

ABOUT_TO_EXIT_JS

) = range(7) # availale states

state = NORMAL # start with normal state

tag_name = ''

s = """

var i = 10;

if (i < 5) {

// some code

}

</script>

test string

<a href="http://google.com"> another string</a>

</sometag>

"""

for char in s:

# print(char, '-', state, ':', tag_name)

if state == NORMAL:

if char == '<':

state = BEFORE_TAG_OPENING_OR_ENDING

else:

output += char

elif state == BEFORE_TAG_OPENING_OR_ENDING:

if char == '/':

state = EXITING_TAG

else:

tag_name += char

state = TAG_NAME

elif state == TAG_ATTRIBUTE:

if char == '>':

if tag_name == 'script':

state = INSIDE_JAVASCRIPT

else:

state = NORMAL

elif state == TAG_NAME:

if char == ' ':

state = TAG_ATTRIBUTE

elif char == '>':

if tag_name == 'script':

state = INSIDE_JAVASCRIPT

else:

state = NORMAL

else:

tag_name += char

elif state == INSIDE_JAVASCRIPT:

if char == '<':

state = ABOUT_TO_EXIT_JS

else:

pass

# output += char

elif state == ABOUT_TO_EXIT_JS:

if char == '/':

state = EXITING_TAG

tag_name = ''

else:

# output += '<'

state = INSIDE_JAVASCRIPT

elif state == EXITING_TAG:

if char == '>':

state = NORMAL

print(output)

输出：

test string

another string

反对回复 2023-09-19

3 回答
0 关注
111 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如何使用 Python 删除 JavaScript 和其他标签...而不导入模块

如何使用 Python 删除 JavaScript 和其他标签...而不导入模块

3 回答

添加回答