超过 10000 个条目的 JMS 序列化程序性能问题

目前我正在构建一个可以更新我的 ElasticSearch 索引的 PHP 命令。但是，我注意到的一件大事是，当我的数组包含超过 10000 个实体时，序列化实体会花费太多时间。我认为它会是线性的，但是 6 或 9k 实体都需要一分钟（6 或 9k 之间没有太大区别），但是当您超过 10k 时，它只会减慢到最多需要 10 分钟的程度。... // we iterate on the documents previously requested to the sql database foreach($entities as $index_name => $entity_array) { $underscoreClassName = $this->toUnderscore($index_name); // elasticsearch understands underscored names $camelcaseClassName = $this->toCamelCase($index_name); // sql understands camelcase names // we get the serialization groups for each index from the config file $groups = $indexesInfos[$underscoreClassName]['types'][$underscoreClassName]['serializer']['groups']; foreach($entity_array as $entity) { // each entity is serialized as a json array $data = $this->serializer->serialize($entity, 'json', SerializationContext::create()->setGroups($groups)); // each serialized entity as json is converted as an Elastica document $documents[$index_name][] = new \Elastica\Document($entityToFind[$index_name][$entity->getId()], $data); } }...有一整节课都围绕着这件事，但这就是花费大部分时间的事情。我可以理解序列化是一项繁重的操作并且需要时间，但是为什么 6、7、8 或 9k 之间几乎没有区别，但是当实体超过 10k 时，它需要花费很多时间？PS：作为参考，我在 github 上打开了一个问题。编辑：为了更准确地解释我想要做的事情，我们在 Symfony 项目上有一个 SQL 数据库，使用 Doctrine 将两者链接起来，并且我们正在使用 ElasticSearch（以及捆绑 FOSElastica 和 Elastica）将我们的数据索引到 ElasticSearch。问题是，虽然 FOSElastica 负责更新 SQL 数据库中更新的数据，但它不会更新包含此数据的每个索引。（例如，如果你有一个作者和他写的两本书，在 ES 中你会有两本书，里面有作者和作者。FOSElastica 只更新作者，而不是两本书中关于作者的信息）。因此，为了解决这个问题，我正在编写一个脚本，该脚本侦听通过 Doctrine 完成的每次更新，从而获取与更新相关的每个 ElasticSearch 文档，并对其进行更新。这有效，但在我的压力测试中太长了，需要更新 10000 多个大文档。编辑：要添加有关我尝试过的内容的更多信息，我在使用 FOSElastica 的“populate”命令时遇到了同样的问题。9k的时候，一切都很好，很流畅，10k的时候，真的需要很长时间。目前我正在运行测试，减少我的脚本中数组的大小并重置它，到目前为止没有运气。

查看完整描述

2 回答

守着星空守着你

TA贡献1799条经验获得超8个赞

我改变了我的算法的工作方式，首先获取所有需要更新的 id，然后以 500-1000 的批次从数据库中获取它们（我正在运行测试）。

* to avoid creating arrays with too much objects, we loop on the ids and split them by DEFAULT_BATCH_SIZE

* this way we get them by packs of DEFAULT_BATCH_SIZE and add them by the same amount

for ($i = 0 ; $i < sizeof($idsToRequest) ; $i++) {

$currentSetOfIds[] = $idsToRequest[$i];

// every time we have DEFAULT_BATCH_SIZE ids or if it's the end of the loop we update the documents

if ($i % self::DEFAULT_BATCH_SIZE == 0 || $i == sizeof($idsToRequest)-1) {

if ($currentSetOfIds) {

// retrieves from the database a batch of entities

$entities = $thatRepo->findBy(array('id' => $currentSetOfIds));

// serialize and create documents with the entities we got earlier

foreach($entities as $entity) {

$data = $this->serializer->serialize($entity, 'json', SerializationContext::create()->setGroups($groups));

$documents[] = new \Elastica\Document($entityToFind[$indexName][$entity->getId()], $data);

}

// update all the documents serialized

$elasticaType->updateDocuments($documents);

// reset of arrays

$currentSetOfIds = [];

$documents = [];

}

我正在以相同的数量更新它们，但它仍然没有提高序列化方法的性能。我真的不明白它与序列化程序有什么不同，我有 9k 或 10k 个实体，而它从来不知道......

反对回复 2021-10-15

阿波罗的战车

TA贡献1862条经验获得超6个赞

在我看来，您应该检查内存消耗：您正在构建一个大数组，其中列出了很多对象。

您有两种解决方案：使用生成器避免构建该数组，或者尝试每“x”次迭代推送您的文档并重置您的数组。

我希望这能让您了解如何处理此类迁移。

顺便说一句，我差点忘了告诉你避免使用 ORM/ODM 存储库来检索数据（在迁移脚本中）。问题是他们会使用对象并给它们加水，老实说，在庞大的迁移脚本中，你只会永远等待。如果可能，只需使用 Database 对象，这可能足以满足您的需求。

反对回复 2021-10-15

热搜

最近搜索清空

超过 10000 个条目的 JMS 序列化程序性能问题

超过 10000 个条目的 JMS 序列化程序性能问题

2 回答

添加回答