4 回答
TA贡献2019条经验 获得超9个赞
如果您的字符串始终采用 format name from place and name from place,您可以这样做:
import pandas as pd
# your consistently formatted string
s = "Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco"
l = list() # a list to keep track of data - I am sure there's a better way to do this
for row in s.split('and'): # each row looks like "name from affiliation"
# l = [(name, affiliation), ...]
l.append(n.split((n.strip() for n in row.split('from'))
# then create the DataFrame
df = pd.DataFrame(data = l, columns = ['Name', 'Affiliation'])
# you might want to strip the names and affiliations using pandas DataFrame using a lambda expression
TA贡献1873条经验 获得超9个赞
您可以进行正则表达式匹配并创建 df. 此处显示一个字符串的示例方法:
text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr.
Elton John, Public Health Director for Davis County"
text = text.replace(', and' ,',')
re.findall("([\w\s]+),([\w\s]+)",text)
df = pd.DataFrame(r)
df.columns = ("Name", "Affiliation")
print(df)
输出:
Name Affiliation
0 Sharif Amlani UC Davis Health
1 Joe Biden UC San Francisco
2 Elton John Public Health Director for Davis County
TA贡献2051条经验 获得超10个赞
在抓取过程中,一切都归结为模式匹配。如果字符串的格式不一致,可能会非常痛苦。不幸的是,就你而言,情况似乎就是这样。因此,我建议根据具体情况进行处理。
我可以观察到这样一种模式,除了一个例外,所有名字都以“博士”开头。您可以使用它通过正则表达式提取名称。
import re
text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"
regex = '((Dr.)( [A-Z]{1}[a-z]+)+)' # this will return three groups of matches
names = [match[0] for match in re.findall(regex, text)] #simply extracting the first group of matches, which is the name
您可以将其应用于其他字符串,但正如我上面提到的,限制是它只能捕获以“Dr.”开头的名称。您也可以对附属关系使用类似的策略。请注意,“,”分隔名称和从属关系,以便我们可以使用它。
import re
text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"
affiliations = [term for term in text.split(',') if 'Dr.' not in term] # splitting the text by a comma and then excluding results that contain a 'Dr.'
同样,您必须根据特定文本调整您的解决方案,但希望这可以帮助您思考问题。最后,您可以使用 pandas 将结果合并到数据框中:
import pandas as pd
data = pd.DataFrame(list(zip(names, affiliations)), columns = ['Name', 'Affiliation'])
TA贡献2016条经验 获得超9个赞
以下是此示例文本的示例代码:
text = "\
Sharif Amlani UC Davis Health\n\
Joe Biden UC San Francisco\n\
Elton John Public Health Director for Davis County\n\
Winston Bishop UC San Francisco\n\
Usain Bolt UC San Francisco"
lines = text.split('\n')
df = pd.concat([pd.DataFrame([[line[0:16].strip(),line[16:].strip()]]) for line in lines])
添加回答
举报