3 回答
TA贡献1784条经验 获得超9个赞
您可以将split逗号分隔的列转换为列表,explode将列表列转换为数据帧,groupby分解列表中的名称和值,unstack或者将数据帧转换为所需的格式,并使用, ,pivot_table对多索引列进行最终清理ETC。droplevel()reset_index()
以下所有内容都是矢量化的 pandas 方法,所以希望它很快。注意:当我从剪贴板读取并通过时,下面的代码中的三列是 [0,1,2]headers=None
输入:
df = pd.DataFrame({0: {0: 'john', 1: 'tom', 2: 'mary', 3: 'john', 4: 'mary'},
1: {0: '12-02-1999',
1: '12-02-1999',
2: '12-03-1999',
3: '12-02-2000',
4: '12-03-2000'},
2: {0: 'hello#,there#,how#,are#,you#,tom$ ',
1: 'hey#,john$,hows#, it#, goin#',
2: "hello#,boys#,fancy#,meetin#,ya'll#,here#",
3: 'well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$',
4: 'catch#,you#,on#,the#,flipside#,tom$,and#,john$'}})
代码:
df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')
df = (df.explode(2)
.groupby([0, 2])[2].count()
.rename('Count')
.reset_index()
.set_index([0,2])
.unstack(1)
.fillna(0))
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
2 0 goin it mary and are been boys catch catching ... on \
0 john 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 ... 0.0
1 mary 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 ... 1.0
2 tom 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
2 the there tom tom up well with ya'll you
0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 2.0
1 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0
您还可以使用.pivot_table代替.unstack(),这样可以节省这行代码:df.columns = df.columns.droplevel():
df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')
df = (df.explode(2)
.groupby([0, 2])[2].count()
.rename('Count')
.reset_index()
.pivot_table(index=0, columns=2, values='Count')
.fillna(0)
.astype(int)
.reset_index())
df
Out[45]:
2 0 goin it mary and are been boys catch catching ... on \
0 john 0 0 1 1 1 1 0 0 1 ... 0
1 mary 0 0 0 1 0 0 1 1 0 ... 1
2 tom 1 1 0 0 0 0 0 0 0 ... 0
2 the there tom tom up well with ya'll you
0 0 1 0 1 1 1 1 0 2
1 1 0 1 0 0 0 0 1 1
2 0 0 0 0 0 0 0 0 0
[3 rows x 31 columns]
TA贡献1785条经验 获得超4个赞
如果你会使用scikit-learn,那就很容易了CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
s = df['words'].str.replace("#|\$|\s+", "")
model = CountVectorizer(tokenizer=lambda x: x.split(','))
df_final = pd.DataFrame(model.fit_transform(s).toarray(),
columns=model.get_feature_names(),
index=df.user).sum(level=0)
Out[279]:
and are been boys catch catching fancy flipside goin hello \
user
john 1 1 1 0 0 1 0 0 0 1
tom 0 0 0 0 0 0 0 0 1 0
mary 1 0 0 1 1 0 1 1 0 1
here hey how hows it its john mary meetin nice on the there \
user
john 0 0 1 0 0 1 0 1 0 1 0 0 1
tom 0 1 0 1 1 0 1 0 0 0 0 0 0
mary 1 0 0 0 0 0 1 0 1 0 1 1 0
tom up well with ya'll you
user
john 1 1 1 1 0 2
tom 0 0 0 0 0 0
mary 1 0 0 0 1 1
TA贡献1851条经验 获得超5个赞
我不确定这种方法在大型 DataFrame 上的速度有多快,但您可以尝试一下。首先,删除特殊字符并将字符串拆分为单词列表,从而形成另一列:
from itertools import chain from collections import Counter df['lists'] = df['words'].str.replace("#|\$", "").str.split(",")
现在,按用户分组),将列表收集到一个列表中,并使用以下命令计算出现次数Counter
:
df.groupby('user')['lists'].apply(chain.from_iterable)\ .apply(Counter)\ .apply(pd.Series)\ .fillna(0).astype(int)
添加回答
举报