为了账号安全,请及时绑定邮箱和手机立即绑定

如何获取列表中附加的非字母和非数字字符?

如何获取列表中附加的非字母和非数字字符?

慕容3067478 2023-05-09 15:01:28
这是关于简单的字数统计,收集文档中出现的单词以及出现的频率。我尝试编写一个函数,输入是文本行列表。我遍历所有行,将它们拆分成单词,累积识别出的单词,最后返回完整列表。首先,我有一个 while 循环遍历列表中的所有字符,但忽略空格。在这个 while 循环中,我也尝试识别我有什么样的词。在这种情况下,有三种词:以字母开头的;以数字开头的;以及那些只包含一个既不是字母也不是数字的字符的。我有三个 if 语句来检查我有什么样的角色。当我知道我遇到了什么样的词时,我会尝试提取这个词本身。当单词以字母或数字开头时,我将所有连续的同类字符作为单词的一部分。但是,在第三个 if 语句中,当我处理当前字符既不是字母也不是数字的情况时,我遇到了问题。当我输入时wordfreq.tokenize(['15,    delicious&   Tarts.'])我希望输出是['15', ',', 'delicious', '&', 'tarts', '.']当我在 Python 控制台中测试函数时,它看起来像这样:PyDev console: starting.Python 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) [Clang 6.0 (clang-600.0.57)] on darwinimport wordfreqwordfreq.tokenize(['15,    delicious&   Tarts.'])['15', 'delicious', 'tarts']该函数既不考虑逗号、符号也不考虑点!我该如何解决?请参阅下面的代码。( lower() 方法是因为我想忽略大写,例如 'Tarts' 和 'tarts' 实际上是同一个词。)# wordfreq.pydef tokenize(lines):    words = []    for line in lines:        start = 0        while start < len(line):            while line[start].isspace():                start = start + 1            if line[start].isalpha():                end = start                while line[end].isalpha():                    end = end + 1                word = line[start:end]                words.append(word.lower())                start = end            elif line[start].isdigit():                end = start                while line[end].isdigit():                    end = end + 1                word = line[start:end]                words.append(word)                start = end            else:                words.append(line[start])            start = start + 1    return words
查看完整描述

3 回答

?
qq_遁去的一_1

TA贡献1725条经验 获得超7个赞

我发现了问题所在。线


start = start + 1

应该在最后一个 else 语句中的位置。


所以我的代码看起来像这样,并为我提供了上面指定的所需输入:


def tokenize(lines):

    words = []

    for line in lines:

        start = 0

        while start < len(line):

            while line[start].isspace():

                start = start + 1

            end = start

            if line[start].isalpha():

                while line[end].isalpha():

                    end = end + 1

                word = line[start:end]

                word = word.lower()

                words.append(word)

                start = end

            elif line[start].isdigit():

                while line[end].isdigit():

                    end = end + 1

                word = line[start:end]

                words.append(word)

                start = end

            else:

                word = line[start]

                words.append(word)

                start = start + 1

    return words

但是,当我使用下面的测试脚本来确保没有遗漏函数“tokenize”的极端情况时;...


import io

import sys

import importlib.util


def test(fun,x,y):

    global pass_tests, fail_tests

    if type(x) == tuple:

        z = fun(*x)

    else:

        z = fun(x)

    if y == z:

        pass_tests = pass_tests + 1

    else:

        if type(x) == tuple:

            s = repr(x)

        else:

            s = "("+repr(x)+")"

        print("Condition failed:")

        print("   "+fun.__name__+s+" == "+repr(y))

        print(fun.__name__+" returned/printed:")

        print(str(z))

        fail_tests = fail_tests + 1


def run(src_path=None):

    global pass_tests, fail_tests


    if src_path == None:

        import wordfreq

    else:

        spec = importlib.util.spec_from_file_location("wordfreq", src_path+"/wordfreq.py")

        wordfreq = importlib.util.module_from_spec(spec)

        spec.loader.exec_module(wordfreq)


    pass_tests = 0

    fail_tests = 0

    fun_count  = 0


    def printTopMost(freq,n):

        saved = sys.stdout

        sys.stdout = io.StringIO()

        wordfreq.printTopMost(freq,n)

        out = sys.stdout.getvalue()

        sys.stdout = saved

        return out


    if hasattr(wordfreq, "tokenize"):

        fun_count = fun_count + 1

        test(wordfreq.tokenize, [], [])

        test(wordfreq.tokenize, [""], [])

        test(wordfreq.tokenize, ["   "], [])

        test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])

        test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])

        test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])

        test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])

        test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])

    else:

        print("tokenize is not implemented yet!")


    if hasattr(wordfreq, "countWords"):

        fun_count = fun_count + 1

        test(wordfreq.countWords, ([],[]), {})

        test(wordfreq.countWords, (["clean","water"],[]), {"clean":1,"water":1})

        test(wordfreq.countWords, (["clean","water","is","drinkable","water"],[]), {"clean":1,"water":2,"is":1,"drinkable":1})

        test(wordfreq.countWords, (["clean","water","is","drinkable","water"],["is"]), {"clean":1,"water":2,"drinkable":1})

    else:

        print("countWords is not implemented yet!")


    if hasattr(wordfreq, "printTopMost"):

        fun_count = fun_count + 1

        test(printTopMost,({},10),"")

        test(printTopMost,({"horror": 5, "happiness": 15},0),"")

        test(printTopMost,({"C": 3, "python": 5, "haskell": 2, "java": 1},3),"python                  5\nC                       3\nhaskell                 2\n")

    else:

        print("printTopMost is not implemented yet!")


    print(str(pass_tests)+" out of "+str(pass_tests+fail_tests)+" passed.")


    return (fun_count == 3 and fail_tests == 0)


if __name__ == "__main__":

    run()

...我得到以下输出:


/usr/local/bin/python3.7 "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py"

Traceback (most recent call last):

  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 81, in <module>

    run()

  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 50, in run

    test(wordfreq.tokenize, ["   "], [])

  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 10, in test

    z = fun(x)

  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/wordfreq.py", line 44, in tokenize

    while line[start].isspace():

IndexError: string index out of range

为什么说字符串索引超出范围?我该如何解决这个问题?


查看完整回答
反对 回复 2023-05-09
?
回首忆惘然

TA贡献1847条经验 获得超11个赞

我不确定你为什么要上下做,但这是你如何拆分它的方法:


input = ['15,    delicious&   Tarts.']

line = input[0]

words = line.split(' ')

words = [word for word in words if word]

out:

['15,', 'delicious&', 'Tarts.']

编辑,看到你编辑了你想要的输出方式。只需跳过这一行即可获得该输出:


    words = [word for word in words if word]


查看完整回答
反对 回复 2023-05-09
?
素胚勾勒不出你

TA贡献1827条经验 获得超9个赞

itertools.groupby可以大大简化这一点。基本上,您根据字符的类别或类型(字母、数字或标点符号)对字符串中的字符进行分组。在此示例中,我只定义了这三个类别,但您可以根据需要定义任意数量的类别。任何不匹配任何类别的字符(本例中为空格)将被忽略:


def get_tokens(string):

    from itertools import groupby

    from string import ascii_lowercase, ascii_uppercase, digits, punctuation as punct

    alpha = ascii_lowercase + ascii_uppercase


    yield from ("".join(group) for key, group in groupby(string, key=lambda char: next((category for category in (alpha, digits, punct) if char in category), "")) if key)


print(list(get_tokens("15,    delicious&   Tarts.")))

输出:


['15', ',', 'delicious', '&', 'Tarts', '.']

>>> 


查看完整回答
反对 回复 2023-05-09
  • 3 回答
  • 0 关注
  • 134 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信