4 回答
TA贡献1827条经验 获得超4个赞
类似于@Manakin的东西
转Fish Countint列表
df['Fish Count']=df['Fish Count'].str.split(',')
分解以将每条鱼与它的 id 分开
df2=df.explode('Fish Count')
创建字典。Fish Count在这里,我使用列表推导式在将值拆分为数字后的空格后派生键和值
{i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}
结果
{'38': 'Sand Bass',
' 16': 'Sculpin',
' 10': 'Blacksmith',
'138': 'Sculpin',
' 28': 'Sand Bass',
'150': 'Sculpin Released',
' 102': 'Sculpin',
' 40': 'Sanddab',
'156': 'Sculpin',
' 29': 'Sand Bass',
' 5': 'Black Croaker',
'161': 'Sculpin'}
如果需要可以打印
print(pd.DataFrame.from_dict({i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}, orient='index'))
0
38 Sand Bass
16 Sculpin
10 Blacksmith
138 Sculpin
28 Sand Bass
150 Sculpin Released
102 Sculpin
40 Sanddab
156 Sculpin
29 Sand Bass
5 Black Croaker
161 Sculpin
TA贡献1808条经验 获得超4个赞
IIUC,我们可以使用str.splitand str.extractwithstack
s = df['Fish Count'].str.split(',',expand=True).stack()
s.str.extract('(\d+)(\D+)')
产量 -
0 1
0 0 38 Sand Bass
1 16 Sculpin
2 10 Blacksmith
1 0 138 Sculpin
1 28 Sand Bass
2 0 150 Sculpin Released
1 102 Sculpin
2 40 Sanddab
3 0 156 Sculpin
1 29 Sand Bass
2 5 Black Croaker
3 3 ...
4 0 161 Sculpin
然后由您决定您想要/需要的格式。
IE
s.str.extract('(\d+)(\D+)').groupby(level=[1]).agg(list)
0 1
0 [38, 138, 150, 156, 161] [ Sand Bass, Sculpin, Sculpin Released, Scu...
1 [16, 28, 102, 29] [ Sculpin, Sand Bass, Sculpin, Sand Bass]
2 [10, 40, 5] [ Blacksmith, Sanddab, Black Croaker]
3 [3] [ ...]
或者
s.str.extract('(\d+)(\D+)').unstack(1)
0 1
0 1 2 3 0 1 2 3
0 38 16 10 NaN Sand Bass Sculpin Blacksmith NaN
1 138 28 NaN NaN Sculpin Sand Bass NaN NaN
2 150 102 40 NaN Sculpin Released Sculpin Sanddab NaN
3 156 29 5 3 Sculpin Sand Bass Black Croaker ...
4 161 NaN NaN NaN Sculpin NaN NaN NaN
或者
s.str.extract('(\d+)(\D+)').values
array([['38', ' Sand Bass'],
['16', ' Sculpin'],
['10', ' Blacksmith'],
['138', ' Sculpin'],
['28', ' Sand Bass'],
['150', ' Sculpin Released'],
['102', ' Sculpin'],
['40', ' Sanddab'],
['156', ' Sculpin'],
['29', ' Sand Bass'],
['5', ' Black Croaker'],
['3', ' ...'],
['161', ' Sculpin']], dtype=object)
你可以把它变成一个字典。
# actually i'd use fish : num -
# sorry closed my ide keys can only be unique in a dict.
{num : fish for num, fish in s.str.extract('(\d+)(\D+)').values}
{'38': ' Sand Bass',
'16': ' Sculpin',
'10': ' Blacksmith',
'138': ' Sculpin',
'28': ' Sand Bass',
'150': ' Sculpin Released',
'102': ' Sculpin',
'40': ' Sanddab',
'156': ' Sculpin',
'29': ' Sand Bass',
'5': ' Black Croaker',
'3': ' ...',
'161': ' Sculpin'}
TA贡献1784条经验 获得超9个赞
首先,您需要展开您制作的列表,然后您可以使用 extract with regex 两次,一次匹配数字,然后匹配文本。
有了数据
data = '38 Sand Bass, 16 Sculpin, 10 Blacksmith\n138 Sculpin, 28 Sand Bass\n150 Sculpin Released, 102 Sculpin, 40 Sanddab\n156 Sculpin, 29 Sand Bass, 5 Black Croaker\n161 Sculpin'
df = pd.DataFrame(data.split('\n'), columns=['Fish Count'])
做
countsdf = df['Fish Count'].str.split(', ')
countsdf = countsdf.explode('Fish Count').rename('fish').to_frame()
countsdf['count'] = countsdf.fish.str.extract('([0-9]+)')
countsdf['species'] = countsdf.fish.str.extract('([a-zA-Z]+[ a-zA-Z]*)')
countsdf.drop('fish', axis=1, inplace=True)
输出
count species
0 38 Sand Bass
1 16 Sculpin
2 10 Blacksmith
3 138 Sculpin
4 28 Sand Bass
5 150 Sculpin Released
6 102 Sculpin
7 40 Sanddab
8 156 Sculpin
9 29 Sand Bass
10 5 Black Croaker
11 161 Sculpin
TA贡献1810条经验 获得超4个赞
使用@Manakin 的回答来访问这个多索引数据框:
0 1
0 0 38 Sand Bass
1 16 Sculpin
2 10 Blacksmith
1 0 138 Sculpin
1 28 Sand Bass
2 0 150 Sculpin Released
1 102 Sculpin
2 40 Sanddab
3 0 156 Sculpin
1 29 Sand Bass
2 5 Black Croaker
4 0 161 Sculpin
然后,我重命名了列,去除了“物种”的前导和结尾空白,切换了列顺序,并设置了索引名称。
s.columns = ['num','species']
s.species = s.species.str.strip()
s = s.reindex(['species','num'],axis=1)
s.index.names = ['a','b']
s.head()
species num
a b
0 0 Sand Bass 38
1 Sculpin 16
2 Blacksmith 10
1 0 Sculpin 138
1 Sand Bass 28
然后我展平并重置索引,并删除 b 索引。
s_flat = s.reset_index()
s_reindexed = s_flat.set_index(['a','species'])
s_reindexed = s_reindexed.drop(columns='b')
s_reindexed.head()
num
a species
0 Sand Bass 38
Sculpin 16
Blacksmith 10
1 Sculpin 138
Sand Bass 28
最后,我取消堆叠并删除了柱状多索引级别。我有一个 Null 列,我也必须删除
s_reindexed = s_reindexed.unstack(1)
s_reindexed.columns = s_reindexed.columns.droplevel(0)
s_reset = s_reindexed.drop(columns=np.nan)
s_reset .head()
species Albacore Barracuda Barracuda Released Bat Ray Released Black Croaker Black Seabass Released Blacksmith Blue Perch Bluefin Tuna Bocaccio ...
a
0 NaN NaN NaN NaN NaN NaN 10 NaN NaN NaN ...
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
3 NaN NaN NaN NaN 5 NaN NaN 3 NaN NaN ...
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
添加回答
举报