3 回答
TA贡献1869条经验 获得超4个赞
使用熊猫
主要区别在于,pandas 已将所有数据转换为正确的
dtype
,(例如datetime
,int
, 和float
),并且代码更简洁。此外,数据现在采用了一种有用的格式来执行时间序列分析和绘图,但我建议添加列名称。
df.columns = ['datetime', ..., 'price']
这可以通过 1 行矢量化操作来完成。
如
timeit
测试所示,对于 1M 行数据,使用 pandas 比使用 读取文件with open
和str
查找:00
.
读取文件并
pandas.read_csv
解析第 0 列中的日期。使用
header=None
,因为测试数据中没有提供标题
使用布尔索引选择秒为 0 的日期
使用
.dt
访问器获取.second
.
import pandas as pd
# read the file which apparently has no header and parse the date column
df = pd.read_csv('test.csv', header=None, parse_dates=[0])
# using Boolean indexing to select data when seconds = 00
top_of_the_minute = df[df[0].dt.second == 0]
# save the data
top_of_the_minute.to_csv('clean.csv', header=False, index=False)
# display(top_of_the_minute)
0 1 2 3 4 5 6 7 8
5 2020-08-03 22:17:00 0 0 4803 4800 91 28.05 24.05 58.8917
6 2020-08-03 22:17:00 0 0 4802 4800 91 28.05 24.05 58.8925
7 2020-08-03 22:17:00 0 0 4805 4800 91 28.05 24.05 58.9341
8 2020-08-03 22:17:00 0 0 4802 4800 91 28.05 24.05 58.9683
9 2020-08-03 22:17:00 0 0 4802 4800 91 28.05 23.05 58.9780
# example: rename columns
top_of_the_minute.columns = ['datetime', 'v1', 'v2', 'v3', 'v4', 'v5', 'p1', 'p2', 'p3']
# example: plot the data
p = top_of_the_minute.plot('datetime', 'p3')
p.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
p.set_xlim('2020-08', '2020-09')
test.csv
2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917
2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925
2020-08-03 22:17:14,0,0,4805,4800,91,28.05,24.05,58.9341
2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683
2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978
2020-08-03 22:17:00,0,0,4803,4800,91,28.05,24.05,58.8917
2020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.8925
2020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.9341
2020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.9683
2020-08-03 22:17:00,0,0,4802,4800,91,28.05,23.05,58.978
%%timeit测试
创建测试数据
# read test.csv
df = pd.read_csv('test.csv', header=None, parse_dates=[0])
# create a dataframe with 1M rows
df = pd.concat([df] * 100000)
# save the new test data
df.to_csv('test.csv', index=False, header=False)
test_sk
def test_sk(path: str):
zero_entries = []
with open(path, "r") as file:
for line in file:
semi_index = line.index(',')
if line[:semi_index].endswith(':00'):
zero_entries.append(line)
return zero_entries
%%timeit
result_sk = test_sk('test.csv')
[out]:
668 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
test_tm
def test_tm(path: str):
df = pd.read_csv(path, header=None, parse_dates=[0])
return df[df[0].dt.second == 0]
%%timeit
result_tm = test_tm('test.csv')
[out]:
774 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
TA贡献1871条经验 获得超13个赞
试试这个
finalmasterlist2 = []
for i in range(len(altmasterlist)):
if ":00" in altmasterlist[i][0]:
finalmasterlist2.extend(altmasterlist[i])
print("finalemasterlist_2")
print(finalmasterlist2)
输入:
2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917
2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925
2020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.9341
2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683
2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978
输出:
['2020-08-03 22:17:00', '0', '0', '4805', '4800', '91', '28.05', '24.05', '58.9341']
TA贡献1757条经验 获得超7个赞
你说你的文件很大?也许最好在阅读时拆分数据。
您可以在没有库的情况下这样做:
zero_entries = []
with open(path_to_file, "r") as file:
# iterates over every line
for line in file:
# finds the end if the first cell
timestamp_end = line.index(',')
# checks if the timestamp ends on zero seconds and adds it to a list.
if line[:timestamp_end].endswith(':00'):
zero_entries.append(line)
print(zero_entries)
我假设您的时间戳将始终是该行的第一个元素。
根据您的文件大小,这将比 Trenton 的解决方案快得多(我用 ~58k 行对其进行了测试):
import time
import pandas as pd
path = r"txt.csv"
start = time.time()
zero_entries = []
with open(path, "r") as file:
for line in file:
semi_index = line.index(',')
if line[:semi_index].endswith(':00'):
zero_entries.append(line)
end = time.time()
print(end-start)
start = time.time()
df = pd.read_csv(path, header=None, parse_dates=[0])
# using Boolean indexing to select data when seconds = 00
top_of_the_minute = df[df[0].dt.second == 0]
end = time.time()
print(end-start)
0.04886937141418457 # built-in
0.27971720695495605 # pandas
添加回答
举报