我正在编写一个应用程序,它处理大量具有深层节点结构的 xml 文件 (>1000).使用 woodstox (Event API) 大约需要 6 秒来解析一个包含 22.000 个节点的文件.
I'm writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox (Event API) to parse a file with 22.000 Nodes.
算法被放置在与用户交互的过程中,其中只有几秒钟的响应时间是可以接受的.所以我需要改进如何处理xml文件的策略.
The algorithm is placed in a process with user interaction where only a few seconds response time are acceptable. So I need to improve the strategy how to handle the xml files.
现在我正在考虑一种多线程解决方案(在 16 Core+ 硬件上可以更好地扩展).我想到了以下策略:
Now I'm thinking about a multithreaded solution (which scales better on 16 Core+ hardware). I thought about the following stategies:
我想同时提高整体性能和每个文件"的性能.
您有解决此类问题的经验吗?最好的方法是什么?
Do you have experience with such problems? What is the best way to go?
这一点很明显:只需创建几个解析器并在多个线程中并行运行它们.
This one is obvious: just create several parsers and run them in parallel in multiple threads.
看看 Woodstox 性能(暂时关闭,试试 google 缓存).
Take a look at Woodstox Performance (down at the moment, try google cache).
如果您的 XML 结构是可预测的,则可以做到这一点:如果它有很多相同的顶级元素.例如:
This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:
<element>
<more>more elements</more>
</element>
<element>
<other>other elements</other>
</element>
在这种情况下,您可以创建简单的拆分器来搜索 <element> 并将此部分提供给特定的解析器实例.这是一种简化的方法:在现实生活中,我会使用 RandomAccessFile 来查找起点 (<element>),然后创建仅对文件的一部分进行操作的自定义 FileInputStream.
In this case you could create simple splitter that searches <element> and feeds this part to a particular parser instance. That's a simplified approach: in real life I'd go with RandomAccessFile to find start stop points (<element>) and then create custom FileInputStream that just operates on a part of file.
看看 Aalto.创造伍德斯托克斯的人.这是该领域的专家 - 不要重新发明轮子.
Take a look at Aalto. The same guys that created Woodstox. This are experts in this area - don't reinvent the wheel.
这篇关于Java 中的并行 XML 解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!
上传进度侦听器未触发(Google 驱动器 API)Upload progress listener not fired (Google drive API)(上传进度侦听器未触发(Google 驱动器 API))
使用 Google Drive SDK 将文件保存在特定文件夹中Save file in specific folder with Google Drive SDK(使用 Google Drive SDK 将文件保存在特定文件夹中)
Google Drive Android API - 无效的 DriveId 和 Null ResourcGoogle Drive Android API - Invalid DriveId and Null ResourceId(Google Drive Android API - 无效的 DriveId 和 Null ResourceId)
谷歌驱动api服务账户查看上传文件到谷歌驱动使Google drive api services account view uploaded files to google drive using java(谷歌驱动api服务账户查看上传文件到谷歌驱动使用java
Google Drive 服务帐号返回 403 usageLimitsGoogle Drive service account returns 403 usageLimits(Google Drive 服务帐号返回 403 usageLimits)
com.google.api.client.json.jackson.JacksonFactory;Google Drcom.google.api.client.json.jackson.JacksonFactory; missing in Google Drive example(com.google.api.client.json.jackson.JacksonFactory;Google Drive 示例