为了账号安全,请及时绑定邮箱和手机立即绑定

如何使用 Python 删除 JavaScript 和其他标签...而不导入模块

如何使用 Python 删除 JavaScript 和其他标签...而不导入模块

神不在的星期二 2023-09-19 14:15:07
对于学校项目的第一部分,我试图弄清楚如何删除 JavaScript<script {...} >和</script {...} >标签以及<和之间的任何内容>。然而,我们无法导入任何模块(甚至是Python内置的模块),因为显然标记可能无法访问它们等等。我试过这个:text = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"while text.find("<script") >= 0:    script_start = text.find("<script")    script_end = text.find(">", text.find("</script")) + 1    text = text[:script_start] + text[script_end:]while text.find("<") >= 0:    script2_start = text.find("<")    script2_end = text.find(">") + 1    text = text[:script2_start] + text[script2_end:]这确实适用于较小的文件,但该项目与大文本文件有关(我们给出的简化测试文件是 10.4MB),因此它不会完成并且会卡住。有人有任何想法可以提高效率吗?
查看完整描述

3 回答

?
大话西游666

TA贡献1817条经验 获得超14个赞

您不需要删除任何内容。事实上:你永远不想修改字符串。

字符串是不可变的:每次“修改”字符串时,您都会创建一个新字符串并丢弃旧字符串。这是对处理器和内存的浪费。

您正在对文件进行操作 - 因此请按字符方式处理它:

  • 记住你是否在<...>里面

  • 如果是这样,唯一重要的特征就是 >再次出去

  • 如果外面和字符是<你进入里面并忽略该字符

  • 如果在外部而不是在外部,<则将字符写入输出(-file)

# create file

with open("somefile.txt","w") as f:

    # up the multiplicator to 10000000 to create something in the megabyte range

    f.write("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n"*10)


# open file to read from and file to write to

with open("somefile.txt") as f, open("otherfile.txt","w") as out:

    # starting outside

    inside = False

    # we iterate the file line by line

    for line in f:

        # and each line characterwise

        for c in line:

            if not inside and c == "<":

                inside = True

            elif inside and c != ">":

                continue

            elif inside and c == ">":

                inside = False

            elif not inside:

                # only case to write to out

                out.write(c)


print(open("somefile.txt").read() + "\n")

print(open("otherfile.txt").read())

输出:


<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata

<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata

<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata

<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata

<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata

<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata

<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata

<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata

<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata

<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata



 hello  hello  hey  tata

 hello  hello  hey  tata

 hello  hello  hey  tata

 hello  hello  hey  tata

 hello  hello  hey  tata

 hello  hello  hey  tata

 hello  hello  hey  tata

 hello  hello  hey  tata

 hello  hello  hey  tata

 hello  hello  hey  tata

如果不允许直接操作文件,请将文件读入消耗 11+Mbyte 内存的列表中:


data = list("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n" * 10)


result = []


inside = False

for c in data:

    if inside:

        if c == ">":

            inside = False

        # else ignore c - because we are inside

    elif c == "<":

        inside = True

    else:

        result.append(c)


print(''.join(result))

这仍然比迭代搜索列表中第一次出现的“<”更好,但可能需要最多两倍的源内存(如果它不包含任何 <..>,则将列表加倍)。


操作文件比进行任何就地列表修改(这将是第三种方法)的内存效率要高得多。


您还需要解决一些明显的问题,例如


<script type="text/javascript">

var i = 10;

if (i < 5) {

  // some code

}

</script>

会将“代码”留在里面。


这可能会解决更简单的极端情况:


# open file to read from and file to write to

with open("somefile.txt") as f, open("otherfile.txt","w") as out:

    # starting outside

    inside = False

    insideJS = False

    jsStart = 0

    # we iterate the file line by line

    for line in f:


        # string manipulation :/ - will remove <script ...> .. </script ..>

        # even over multiple lines - probably missed some cornercases.

        while True:

          if insideJS and not "</script" in line:

              line = ""

              break


          if "<script" in line:

              insideJS = True

              jsStart = line.index("<script")

              jsEnd = len(line)

          elif insideJS:

              jsStart = 0

          

          if not insideJS:

              break


          if "</script" in line:

              jsEnd = line.index(">", line.index("</script", jsStart))+1

              line = line[:jsStart] + line[jsEnd:]

              insideJS = False

          else:

              line = line[:jsStart]


        # and each line characterwise

        for c in line:

            # ... same as above ...


查看完整回答
反对 回复 2023-09-19
?
偶然的你

TA贡献1841条经验 获得超3个赞

即使有2个while循环,它仍然是线性复杂度


string = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"

new_string = ''

i = 0

while i < len(string):

    if string[i] == "<":

        while i < len(string):

            i += 1

            if string[i] == '>':

                break

    else:

        new_string += string[i]

    i += 1


print(new_string)

输出:


 hello  hello  hey 


查看完整回答
反对 回复 2023-09-19
?
呼唤远方

TA贡献1856条经验 获得超11个赞

以下是FSA的一种方法:

output = ''


NORMAL, INSIDE_TAG = range(2) # availale states


state = NORMAL # start with normal state


s = '<script beep beep> hello </script boop doop woop> hello <hi id="someid" class="some class"><a> hey </a><bye>'


for char in s:

  if char == '<': # if we encounter '<' we enter the INSIDE_TAG state

    state = INSIDE_TAG

    continue

  elif char == '>': # we can safely exit the INSIDE_TAG state

    state = NORMAL

    continue


  if state == NORMAL:

    output += char  # add the char to the output only if we are in normal state


print(output)

如果需要解析标签语义,请确保使用堆栈(可以使用 实现list)。


这会增加复杂性,但您可以使用 FSM 实现可靠的检查。


请参见以下示例:


output = ''


(

  NORMAL,

  TAG_ATTRIBUTE,

  INSIDE_JAVASCRIPT,

  EXITING_TAG,

  BEFORE_TAG_OPENING_OR_ENDING,

  TAG_NAME,

  ABOUT_TO_EXIT_JS

) = range(7) # availale states


state = NORMAL # start with normal state


tag_name = ''


s = """

<script type="text/javascript">

  var i = 10;

  if (i < 5) {

    // some code

  }

</script>

<sometag>

  test string

  <a href="http://google.com"> another string</a>

</sometag>

"""


for char in s:

  # print(char, '-', state, ':', tag_name)

  if state == NORMAL:

    if char == '<':

      state = BEFORE_TAG_OPENING_OR_ENDING

    else:

      output += char

  elif state == BEFORE_TAG_OPENING_OR_ENDING:

    if char == '/':

      state = EXITING_TAG

    else:

      tag_name += char

      state = TAG_NAME

  elif state == TAG_ATTRIBUTE:

    if char == '>':

      if tag_name == 'script':

        state = INSIDE_JAVASCRIPT

      else:

        state = NORMAL

  elif state == TAG_NAME:

    if char == ' ':

      state = TAG_ATTRIBUTE

    elif char == '>':

      if tag_name == 'script':

        state = INSIDE_JAVASCRIPT

      else:

        state = NORMAL

    else:

      tag_name += char

  elif state == INSIDE_JAVASCRIPT:

    if char == '<':

      state = ABOUT_TO_EXIT_JS

    else:

      pass

      # output += char

  elif state == ABOUT_TO_EXIT_JS:

    if char == '/':

      state = EXITING_TAG

      tag_name = ''

    else:

      # output += '<'

      state = INSIDE_JAVASCRIPT

  elif state == EXITING_TAG:

    if char == '>':

      state = NORMAL


print(output)

输出:


  test string

  another string


查看完整回答
反对 回复 2023-09-19
  • 3 回答
  • 0 关注
  • 111 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信