为了账号安全,请及时绑定邮箱和手机立即绑定

用于大量项目的 C# 字典

用于大量项目的 C# 字典

C#
HUH函数 2021-12-25 16:53:41
我想了解在 C# 中将大量项目存储在内存中的成本。我需要使用的数据结构是字典或类似的。假设我想要的项目数量约为 1 亿,但应用程序不会立即达到该数量。我们需要很长时间才能达到极限。我担心摊销的运营成本,但在任何时候我都不能承受过高的成本。所以通常使用动态数据结构,当结构已满时,它会重新分配自己。在字典的情况下,我认为它甚至会重新索引每个项目。因此,假设我们是应用程序维护 2000 万个刚刚达到字典容量的项目的重点。然后,当分配新的字典存储时,需要重新索引这 2000 万个项目。这就是为什么我认为一系列字典可能是个好主意的原因。假设我创建了 256 个字典。这立即将每个内部字典的大小限制为少于 100 万个项目,这应该是易于管理的,并且所有索引都发生在多达 100 万个项目的过程中。这样做的成本似乎只是每次操作一个额外的索引以找到要查看的正确字典。这是一个合理的方法吗?我的分析是正确的还是我认为 C# 字典会因为某种原因表现得更好?有没有其他更好的解决方案?我正在寻找一种与 C# 字典具有相同时间复杂度的数据结构。编辑:字典键是一个随机值,所以我可以用它的第一口来非常便宜地找到我在 256 个字典数组中的索引。我目前不考虑数据库,因为我希望所有项目都可以立即使用,而且成本很低。我确实需要以很少的开销在恒定时间内查找。我可以承受插入速度变慢,但仍然是恒定的时间。与删除相同,可以慢一点,但需要恒定的时间。应该可以容纳内存中的所有项目。这些项目很小,每个大约 50 字节的数据。所以数据结构对于每一项都不能有太多的开销。
查看完整描述

3 回答

?
慕标5832272

TA贡献1966条经验 获得超4个赞

更新:自从我发布它以来,我已经编辑了它:

  • 存储一个固定大小的对象(byte[50] 每次通过,

  • 在添加到字典之前预先分配所有这些(而不是在循环中创建对象)

  • 在预分配东西后运行 GC.Collect()。

  • gcAllowVeryLargeObjects 设置为真。

  • 肯定是为 x64 设置的(以前是这样,但后来我切换到“发布”以在 VS 之外构建和运行......然后它重置了,哎呀。)

  • 尝试使用和不使用预先分配字典大小。

这是现在的代码:

var arrays = new byte[100000000][];

System.Diagnostics.Stopwatch stopwatch = new System.Diagnostics.Stopwatch();

stopwatch.Start();

for (var i = 0; i<100000000; i++)

{

    arrays[i] = new byte[50];

}

stopwatch.Stop();

Console.WriteLine($"initially allocating arrays took {stopwatch.ElapsedMilliseconds} ms");

stopwatch.Restart();


GC.Collect();

Console.WriteLine($"GC after array allocation took {stopwatch.ElapsedMilliseconds} ms");


Dictionary<int, byte[]> dict = new Dictionary<int, byte[]>(100000000);

//Dictionary<int, byte[]> dict = new Dictionary<int, byte[]>();


for (var c = 0; c < 100; c++)

{

    stopwatch.Restart();

    for (var i = 0; i < 1000000; i++)

    {

        var thing = new AThing();

        dict.Add((c * 1000000) + i, arrays[(c*100000)+i]);

    }

    stopwatch.Stop();

    Console.WriteLine($"pass number {c} took {stopwatch.ElapsedMilliseconds} milliseconds");

}


Console.ReadLine();

这是我不预先分配字典大小时的输出:


initially allocating arrays took 14609 ms

GC after array allocation took 3713 ms

pass number 0 took 63 milliseconds

pass number 1 took 51 milliseconds

pass number 2 took 78 milliseconds

pass number 3 took 28 milliseconds

pass number 4 took 32 milliseconds

pass number 5 took 133 milliseconds

pass number 6 took 41 milliseconds

pass number 7 took 31 milliseconds

pass number 8 took 27 milliseconds

pass number 9 took 26 milliseconds

pass number 10 took 45 milliseconds

pass number 11 took 335 milliseconds

pass number 12 took 34 milliseconds

pass number 13 took 35 milliseconds

pass number 14 took 71 milliseconds

pass number 15 took 66 milliseconds

pass number 16 took 64 milliseconds

pass number 17 took 58 milliseconds

pass number 18 took 71 milliseconds

pass number 19 took 65 milliseconds

pass number 20 took 68 milliseconds

pass number 21 took 67 milliseconds

pass number 22 took 83 milliseconds

pass number 23 took 11986 milliseconds

pass number 24 took 7948 milliseconds

pass number 25 took 38 milliseconds

pass number 26 took 36 milliseconds

pass number 27 took 27 milliseconds

pass number 28 took 31 milliseconds

..SNIP lots between 30-40ms...

pass number 44 took 34 milliseconds

pass number 45 took 34 milliseconds

pass number 46 took 33 milliseconds

pass number 47 took 2630 milliseconds

pass number 48 took 12255 milliseconds

pass number 49 took 33 milliseconds

...SNIP a load of lines which are all between 30 to 50ms...

pass number 93 took 39 milliseconds

pass number 94 took 43 milliseconds

pass number 95 took 7056 milliseconds

pass number 96 took 33323 milliseconds

pass number 97 took 228 milliseconds

pass number 98 took 70 milliseconds

pass number 99 took 84 milliseconds

您可以清楚地看到必须重新分配的点。我猜测只是通过将列表的大小加倍并复制当前列表项,因为最后有很长一段没有这样做。其中一些非常昂贵(30+秒!哎哟)


如果我预先分配了字典大小,这里是输出:


initially allocating arrays took 15494 ms

GC after array allocation took 2622 ms

pass number 0 took 9585 milliseconds

pass number 1 took 107 milliseconds

pass number 2 took 91 milliseconds

pass number 3 took 145 milliseconds

pass number 4 took 83 milliseconds

pass number 5 took 118 milliseconds

pass number 6 took 133 milliseconds

pass number 7 took 126 milliseconds

pass number 8 took 65 milliseconds

pass number 9 took 52 milliseconds

pass number 10 took 42 milliseconds

pass number 11 took 34 milliseconds

pass number 12 took 45 milliseconds

pass number 13 took 48 milliseconds

pass number 14 took 46 milliseconds

..SNIP lots between 30-80ms...

pass number 45 took 80 milliseconds

pass number 46 took 65 milliseconds

pass number 47 took 64 milliseconds

pass number 48 took 65 milliseconds

pass number 49 took 122 milliseconds

pass number 50 took 103 milliseconds

pass number 51 took 45 milliseconds

pass number 52 took 77 milliseconds

pass number 53 took 64 milliseconds

pass number 54 took 96 milliseconds

..SNIP lots between 30-80ms...

pass number 77 took 44 milliseconds

pass number 78 took 85 milliseconds

pass number 79 took 142 milliseconds

pass number 80 took 138 milliseconds

pass number 81 took 47 milliseconds

pass number 82 took 44 milliseconds

..SNIP lots between 30-80ms...

pass number 93 took 52 milliseconds

pass number 94 took 50 milliseconds

pass number 95 took 63 milliseconds

pass number 96 took 111 milliseconds

pass number 97 took 175 milliseconds

pass number 98 took 96 milliseconds

pass number 99 took 67 milliseconds

内存使用量在最初创建数组时上升到 9GB 以上,在 GC.Collect 之后下降到大约 6.5GB,在添加到字典时爬回超过 9GB,然后完成(并且它坐在控制台上等待.Readline()) 之后它会回落到 ~3.7GB 并保持在那里。


这显然远远快于操作预分配的字典虽然。


供参考,原件如下*


我刚刚写了这个小测试。我不知道你在存储什么,所以我刚刚创建了一个没有太多信息的无意义的类,并使用 int 作为关键,但我从中得到的两个要点是:


在增加到大约 4000 万个项目之前,它似乎并没有逐渐变慢添加到字典中。运行针对 x64 的“Release”构建,每百万次插入需要大约 500 毫秒,然后从 41 到 46 需要大约 700-850 毫秒(在这一点上有明显的跳跃)


它达到了超过 46,000,000 个项目,消耗了大约 4GB 的 RAM 并因内存不足异常而倒下。


使用数据库,否则词典滥用团队会来找你。


代码:


class AThing

{

    public string Name { get; set; }

    public int id { get; set; }

}

class Program

{

    static void Main(string[] args)

    {

        Dictionary<int, AThing> dict = new Dictionary<int, AThing>();


        for (var c = 0; c < 100; c++)

        {

            DateTime nowTime = DateTime.Now;

            for (var i = 0; i < 1000000; i++)

            {

                var thing = new AThing { id = (c * 1000000) + i, Name = $"Item {(c * 1000000) + i}" };

                dict.Add(thing.id, thing);

            }

            var timeTaken = DateTime.Now - nowTime;

            Console.WriteLine($"pass number {c} took {timeTaken.Milliseconds} milliseconds");

        }


    }

}


查看完整回答
反对 回复 2021-12-25
?
慕容3067478

TA贡献1773条经验 获得超3个赞

我知道距离你最初问这个问题已经三年了。但是我自己也遇到了同样的问题,并设法通过实现一个解决方案来找到一个解决方案FixedSizeDictionary<TKey, TValue>,我将最大大小作为 an传递,int并且在不断向其中添加项目的同时,它还将在项目计数通过后删除最旧的提供的固定值。


查看完整回答
反对 回复 2021-12-25
?
人到中年有点甜

TA贡献1895条经验 获得超7个赞

如果希望程序在字典处于最大大小时工作,那么为什么不从一开始就将其分配到最大大小并完全避免重新索引等。使用的内存量仅与其他解决方案暂时不同,但节省的时间不是暂时的,另外,当字典处于空状态时,发生冲突的可能性非常低。


查看完整回答
反对 回复 2021-12-25
  • 3 回答
  • 0 关注
  • 204 浏览

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信