为了账号安全,请及时绑定邮箱和手机立即绑定

根据行开头的时间戳过滤文本文件

根据行开头的时间戳过滤文本文件

千巷猫影 2023-06-13 10:44:17
我有这个巨大的文本文件,我想在分钟的顶部获取具有关联数据的行。这是来自该文本文件的几行。这是超过 36 小时的数据片段。我所说的关联是指时间戳后面的 8 个数据点。2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.89172020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.89252020-08-03 22:17:14,0,0,4805,4800,91,28.05,24.05,58.93412020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.96832020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978...我找不到一种方法让 python 查看时间戳的秒部分,然后创建一个仅包含与“:00”秒相关联的数据的新列表。for line in fh:    line = line.rstrip("\n")    line = line.split(",")    masterlist.extend(line) #this is putting the information into one list    altmasterlist.append(line) #this is putting the lines of information into a listfor line in altmasterlist:    if ":00" in line:        finalmasterlist.extend(line) #Nothing is entering this if statementprint(finalmasterlist)我什至在这两个 for 循环的正确区域吗?
查看完整描述

3 回答

?
MMTTMM

TA贡献1869条经验 获得超4个赞

  • 使用熊猫

    • 主要区别在于,pandas 已将所有数据转换为正确的dtype,(例如datetimeint, 和float),并且代码更简洁。

    • 此外,数据现在采用了一种有用的格式来执行时间序列分析和绘图,但我建议添加列名称。

    • df.columns = ['datetime', ..., 'price']

    • 这可以通过 1 行矢量化操作来完成。

    • timeit测试所示,对于 1M 行数据,使用 pandas 比使用 读取文件with openstr查找:00.

  • 读取文件并pandas.read_csv解析第 0 列中的日期。

    • 使用header=None,因为测试数据中没有提供标题

  • 使用布尔索引选择秒为 0 的日期

    • 使用.dt访问器获取.second.

import pandas as pd


# read the file which apparently has no header and parse the date column

df = pd.read_csv('test.csv', header=None, parse_dates=[0])


# using Boolean indexing to select data when seconds = 00

top_of_the_minute = df[df[0].dt.second == 0]


# save the data

top_of_the_minute.to_csv('clean.csv', header=False, index=False)


# display(top_of_the_minute)

                    0  1  2     3     4   5      6      7        8

5 2020-08-03 22:17:00  0  0  4803  4800  91  28.05  24.05  58.8917

6 2020-08-03 22:17:00  0  0  4802  4800  91  28.05  24.05  58.8925

7 2020-08-03 22:17:00  0  0  4805  4800  91  28.05  24.05  58.9341

8 2020-08-03 22:17:00  0  0  4802  4800  91  28.05  24.05  58.9683

9 2020-08-03 22:17:00  0  0  4802  4800  91  28.05  23.05  58.9780


# example: rename columns

top_of_the_minute.columns = ['datetime', 'v1', 'v2', 'v3', 'v4', 'v5', 'p1', 'p2', 'p3']


# example: plot the data

p = top_of_the_minute.plot('datetime', 'p3')

p.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

p.set_xlim('2020-08', '2020-09')

//img2.sycdn.imooc.com/6487d88f0001aef004790265.jpg

test.csv

2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917

2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925

2020-08-03 22:17:14,0,0,4805,4800,91,28.05,24.05,58.9341

2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683

2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978

2020-08-03 22:17:00,0,0,4803,4800,91,28.05,24.05,58.8917

2020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.8925

2020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.9341

2020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.9683

2020-08-03 22:17:00,0,0,4802,4800,91,28.05,23.05,58.978

%%timeit测试

创建测试数据

# read test.csv

df = pd.read_csv('test.csv', header=None, parse_dates=[0])


# create a dataframe with 1M rows 

df = pd.concat([df] * 100000)


# save the new test data

df.to_csv('test.csv', index=False, header=False)

test_sk

def test_sk(path: str):

    zero_entries = []


    with open(path, "r") as file:

        for line in file:

            semi_index = line.index(',')

            if line[:semi_index].endswith(':00'):

                zero_entries.append(line)

    return zero_entries



%%timeit

result_sk = test_sk('test.csv')

[out]:

668 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

test_tm

def test_tm(path: str):

    df = pd.read_csv(path, header=None, parse_dates=[0])

    return df[df[0].dt.second == 0]



%%timeit

result_tm = test_tm('test.csv')

[out]:

774 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


查看完整回答
反对 回复 2023-06-13
?
慕桂英4014372

TA贡献1871条经验 获得超13个赞

试试这个


finalmasterlist2 = []

for i in range(len(altmasterlist)):

    if ":00" in altmasterlist[i][0]:

        finalmasterlist2.extend(altmasterlist[i])

print("finalemasterlist_2")

print(finalmasterlist2)

输入:


2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917 

2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925  

2020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.9341  

2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683  

2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978   

输出:

['2020-08-03 22:17:00', '0', '0', '4805', '4800', '91', '28.05', '24.05', '58.9341']


查看完整回答
反对 回复 2023-06-13
?
长风秋雁

TA贡献1757条经验 获得超7个赞

你说你的文件很大?也许最好在阅读时拆分数据。


您可以在没有库的情况下这样做:


zero_entries = []


with open(path_to_file, "r") as file:

    # iterates over every line 

    for line in file:

        # finds the end if the first cell

        timestamp_end = line.index(',')


        # checks if the timestamp ends on zero seconds and adds it to a list.

        if line[:timestamp_end].endswith(':00'):

            zero_entries.append(line)


print(zero_entries)

我假设您的时间戳将始终是该行的第一个元素。


根据您的文件大小,这将比 Trenton 的解决方案快得多(我用 ~58k 行对其进行了测试):


import time

import pandas as pd

path = r"txt.csv"


start = time.time()

zero_entries = []


with open(path, "r") as file:

    for line in file:

        semi_index = line.index(',')

        if line[:semi_index].endswith(':00'):

            zero_entries.append(line)


end = time.time()

print(end-start)




start = time.time()


df = pd.read_csv(path, header=None, parse_dates=[0])

# using Boolean indexing to select data when seconds = 00

top_of_the_minute = df[df[0].dt.second == 0]


end = time.time()

print(end-start)

0.04886937141418457 # built-in

0.27971720695495605 # pandas


查看完整回答
反对 回复 2023-06-13
  • 3 回答
  • 0 关注
  • 145 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信