Extract Text from MS Word Documents in C#

Microsoft Word 文档是创建和共享文本内容的步骤. 如果您正在开发与这些文档互动的 C# 应用程序,您可能会发现自己需要从这些文档中提取文本。 这可能是用于文本分析或提取文档的特定部分,以编译到一个新的文档。

内容表

C# 文本提取图书馆 {# 文本提取文本从词文档}

ASPOSE.Words 为 .NET 它提供了一系列功能,包括文本提取,文档创建,操作和转换. 通过 Aspose.Words 为 .NET,开发人员可以有效地管理 Word 文件的各个方面,使其为您的开发需求提供无价值的工具。

要开始, 下载图书馆 或者直接从 诺基亚 在包管理器控制台中使用下列命令:

PM> Install-Package Aspose.Words

理解文本提取在Word文档

MS Word 文档包含各种元素,如段落,表和图像. 因此,文本提取的要求可以根据特定使用案例有所不同. 您可能需要在段落,图标,评论等之间提取文本。

因此,要有效处理文档,您将需要使用这些节点,让我们来探索如何在不同的情况下从文档中提取文本。

步骤指南从Word文档中提取文本

在此部分中,我们将实施Word文档的C#文本提取器,文本提取工作流将包括以下步骤:

  • 定义要在提取过程中包含的节点。
  • 在指定的节点之间提取内容(包括或排除开始和结束节点)。
  • 使用克隆提取的节点创建一个包含提取内容的新Word文档。

让我们创建一个名为 ExtractContent 的方法,它会接受节点和其他参数来执行文本提取。

  • StartNodeEndNode:这些定义了内容提取的起点和结束点,这些点可以是区块级(例如, , )或内线级节点(例如, Run, FieldStart, BookmarkStart)。- 对于字段,请输入相应的 FieldStart 对象。

  • 对于图标,请使用 BookmarkStartBookmarkEnd 节点。

  • 对于评论,请使用 CommentRangeStartCommentRangeEnd 节点。

  • IsInclusive: 此参数确定标记是否包含在提取中. 如果设置为虚假,并提供相同或连续的节点,则将返回一个空白列表。

以下是 ExtractContent 方法的完整实施,以便在指定的节点之间提取内容:

public static ArrayList ExtractContent(Node startNode, Node endNode, bool isInclusive)
{
// First check that the nodes passed to this method are valid for use.
VerifyParameterNodes(startNode, endNode);
// Create a list to store the extracted nodes.
ArrayList nodes = new ArrayList();
// Keep a record of the original nodes passed to this method so we can split marker nodes if needed.
Node originalStartNode = startNode;
Node originalEndNode = endNode;
// Extract content based on block level nodes (paragraphs and tables). Traverse through parent nodes to find them.
// We will split the content of first and last nodes depending if the marker nodes are inline
while (startNode.ParentNode.NodeType != NodeType.Body)
startNode = startNode.ParentNode;
while (endNode.ParentNode.NodeType != NodeType.Body)
endNode = endNode.ParentNode;
bool isExtracting = true;
bool isStartingNode = true;
bool isEndingNode = false;
// The current node we are extracting from the document.
Node currNode = startNode;
// Begin extracting content. Process all block level nodes and specifically split the first and last nodes when needed so paragraph formatting is retained.
// Method is little more complex than a regular extractor as we need to factor in extracting using inline nodes, fields, bookmarks etc as to make it really useful.
while (isExtracting)
{
// Clone the current node and its children to obtain a copy.
Node cloneNode = currNode.Clone(true);
isEndingNode = currNode.Equals(endNode);
if ((isStartingNode || isEndingNode) && cloneNode.IsComposite)
{
// We need to process each marker separately so pass it off to a separate method instead.
if (isStartingNode)
{
ProcessMarker((CompositeNode)cloneNode, nodes, originalStartNode, isInclusive, isStartingNode, isEndingNode);
isStartingNode = false;
}
// Conditional needs to be separate as the block level start and end markers maybe the same node.
if (isEndingNode)
{
ProcessMarker((CompositeNode)cloneNode, nodes, originalEndNode, isInclusive, isStartingNode, isEndingNode);
isExtracting = false;
}
}
else
// Node is not a start or end marker, simply add the copy to the list.
nodes.Add(cloneNode);
// Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section.
if (currNode.NextSibling == null && isExtracting)
{
// Move to the next section.
Section nextSection = (Section)currNode.GetAncestor(NodeType.Section).NextSibling;
currNode = nextSection.Body.FirstChild;
}
else
{
// Move to the next node in the body.
currNode = currNode.NextSibling;
}
}
// Return the nodes between the node markers.
return nodes;
}
view raw extract-text.cs hosted with ❤ by GitHub

此外,通过 ExtractContent 方法需要一些辅助方法来促进文本提取操作:

public static List<Paragraph> ParagraphsByStyleName(Document doc, string styleName)
{
// Create an array to collect paragraphs of the specified style.
List<Paragraph> paragraphsWithStyle = new List<Paragraph>();
NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
// Look through all paragraphs to find those with the specified style.
foreach (Paragraph paragraph in paragraphs)
{
if (paragraph.ParagraphFormat.Style.Name == styleName)
paragraphsWithStyle.Add(paragraph);
}
return paragraphsWithStyle;
}
private static void VerifyParameterNodes(Node startNode, Node endNode)
{
// The order in which these checks are done is important.
if (startNode == null)
throw new ArgumentException("Start node cannot be null");
if (endNode == null)
throw new ArgumentException("End node cannot be null");
if (!startNode.Document.Equals(endNode.Document))
throw new ArgumentException("Start node and end node must belong to the same document");
if (startNode.GetAncestor(NodeType.Body) == null || endNode.GetAncestor(NodeType.Body) == null)
throw new ArgumentException("Start node and end node must be a child or descendant of a body");
// Check the end node is after the start node in the DOM tree
// First check if they are in different sections, then if they're not check their position in the body of the same section they are in.
Section startSection = (Section)startNode.GetAncestor(NodeType.Section);
Section endSection = (Section)endNode.GetAncestor(NodeType.Section);
int startIndex = startSection.ParentNode.IndexOf(startSection);
int endIndex = endSection.ParentNode.IndexOf(endSection);
if (startIndex == endIndex)
{
if (startSection.Body.IndexOf(startNode) > endSection.Body.IndexOf(endNode))
throw new ArgumentException("The end node must be after the start node in the body");
}
else if (startIndex > endIndex)
throw new ArgumentException("The section of end node must be after the section start node");
}
private static bool IsInline(Node node)
{
// Test if the node is desendant of a Paragraph or Table node and also is not a paragraph or a table a paragraph inside a comment class which is decesant of a pararaph is possible.
return ((node.GetAncestor(NodeType.Paragraph) != null || node.GetAncestor(NodeType.Table) != null) && !(node.NodeType == NodeType.Paragraph || node.NodeType == NodeType.Table));
}
private static void ProcessMarker(CompositeNode cloneNode, ArrayList nodes, Node node, bool isInclusive, bool isStartMarker, bool isEndMarker)
{
// If we are dealing with a block level node just see if it should be included and add it to the list.
if (!IsInline(node))
{
// Don't add the node twice if the markers are the same node
if (!(isStartMarker && isEndMarker))
{
if (isInclusive)
nodes.Add(cloneNode);
}
return;
}
// If a marker is a FieldStart node check if it's to be included or not.
// We assume for simplicity that the FieldStart and FieldEnd appear in the same paragraph.
if (node.NodeType == NodeType.FieldStart)
{
// If the marker is a start node and is not be included then skip to the end of the field.
// If the marker is an end node and it is to be included then move to the end field so the field will not be removed.
if ((isStartMarker && !isInclusive) || (!isStartMarker && isInclusive))
{
while (node.NextSibling != null && node.NodeType != NodeType.FieldEnd)
node = node.NextSibling;
}
}
// If either marker is part of a comment then to include the comment itself we need to move the pointer forward to the Comment
// Node found after the CommentRangeEnd node.
if (node.NodeType == NodeType.CommentRangeEnd)
{
while (node.NextSibling != null && node.NodeType != NodeType.Comment)
node = node.NextSibling;
}
// Find the corresponding node in our cloned node by index and return it.
// If the start and end node are the same some child nodes might already have been removed. Subtract the
// Difference to get the right index.
int indexDiff = node.ParentNode.ChildNodes.Count - cloneNode.ChildNodes.Count;
// Child node count identical.
if (indexDiff == 0)
node = cloneNode.ChildNodes[node.ParentNode.IndexOf(node)];
else
node = cloneNode.ChildNodes[node.ParentNode.IndexOf(node) - indexDiff];
// Remove the nodes up to/from the marker.
bool isSkip = false;
bool isProcessing = true;
bool isRemoving = isStartMarker;
Node nextNode = cloneNode.FirstChild;
while (isProcessing && nextNode != null)
{
Node currentNode = nextNode;
isSkip = false;
if (currentNode.Equals(node))
{
if (isStartMarker)
{
isProcessing = false;
if (isInclusive)
isRemoving = false;
}
else
{
isRemoving = true;
if (isInclusive)
isSkip = true;
}
}
nextNode = nextNode.NextSibling;
if (isRemoving && !isSkip)
currentNode.Remove();
}
// After processing the composite node may become empty. If it has don't include it.
if (!(isStartMarker && isEndMarker))
{
if (cloneNode.HasChildNodes)
nodes.Add(cloneNode);
}
}
public static Document GenerateDocument(Document srcDoc, ArrayList nodes)
{
// Create a blank document.
Document dstDoc = new Document();
// Remove the first paragraph from the empty document.
dstDoc.FirstSection.Body.RemoveAllChildren();
// Import each node from the list into the new document. Keep the original formatting of the node.
NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KeepSourceFormatting);
foreach (Node node in nodes)
{
Node importNode = importer.ImportNode(node, true);
dstDoc.FirstSection.Body.AppendChild(importNode);
}
// Return the generated document.
return dstDoc;
}

现在我们已经准备好了我们的方法,我们可以继续从Word文档中提取文本。

文本文本文本文本文本文本文本文本文本文本文本文本

要在 Word DOCX 文档中提取两个段落之间的内容,请遵循以下步骤:

  • 使用 文档 类下载 Word 文档。
  • 使用 Document.FirstSection.Body.GetChild(NodeType.PARAGRAPH, int, boolean) 方法获取开始和结束段落的参考。
  • 使用 ExtractContent(startPara、endPara、True) 方法将节点提取到一个对象中。
  • 使用 GenerateDocument(Document, extractedNodes) 辅助方法来创建包含所提取内容的文档。
  • 使用 Document.Save(string) 方法保存新文档。

下面是一个代码样本,表明如何在Word文档中提取第7至第11段之间的文本:

// Load Word document
Document doc = new Document("document.docx");
// Gather the nodes (the GetChild method uses 0-based index)
Paragraph startPara = (Paragraph)doc.FirstSection.Body.GetChild(NodeType.Paragraph, 6, true);
Paragraph endPara = (Paragraph)doc.FirstSection.Body.GetChild(NodeType.Paragraph, 10, true);
// Extract the content between these nodes in the document. Include these markers in the extraction.
ArrayList extractedNodes = ExtractContent(startPara, endPara, true);
// Insert the content into a new document and save it to disk.
Document dstDoc = GenerateDocument(doc, extractedNodes);
dstDoc.Save("output.docx");

不同类型的节点之间的文本提取 {# 文本提取 - 不同类型的节点}

您也可以在不同类型的节点之间提取内容. 例如,让我们在段落和表之间提取内容并将其存储在一个新的 Word 文档中。

  • 使用 文档 类下载 Word 文档。
  • 使用 Document.FirstSection.Body.GetChild(NodeType, int, boolean) 方法获取开始和结束节点的参考。
  • 呼叫 **ExtractContent(startPara,endPara,True)**将节点提取到一个对象中。
  • 使用 GenerateDocument(Document, extractedNodes) 辅助方法来创建包含所提取内容的文档。
  • 使用 Document.Save(string) 保存新文件。

下面是编码样本,以便在 C# 中的段落和表之间提取文本:

// Load Word document
Document doc = new Document("document.docx");
Paragraph startPara = (Paragraph)doc.LastSection.GetChild(NodeType.Paragraph, 2, true);
Table endTable = (Table)doc.LastSection.GetChild(NodeType.Table, 0, true);
// Extract the content between these nodes in the document. Include these markers in the extraction.
ArrayList extractedNodes = ExtractContent(startPara, endTable, true);
// Insert the content into a new document and save it to disk.
Document dstDoc = GenerateDocument(doc, extractedNodes);
dstDoc.Save("output.docx");

以风格为基础的文本提取 {# 以风格为基础的文本提取文本}

要在基于风格的段落之间提取内容,请遵循以下步骤. 为此示范,我们将在Word文档中的第一个“标题1”和第一个“标题3”之间提取内容:

  • 使用 文档 类下载 Word 文档。
  • 将段落提取到一个对象,使用 ParagraphsByStyleName(文档,“标题 1”) 辅助方法。
  • 将段落提取到另一个对象,使用 ParagraphsByStyleName(文件,“标题”
  • 呼叫 ExtractContent(startPara,endPara,True) 与第一部分从两个段落的序列。
  • 使用 GenerateDocument(Document, extractedNodes) 辅助方法来创建包含所提取内容的文档。
  • 使用 Document.Save(string) 保存新文件。

下面是基于风格的段落之间的内容提取的代码样本:

// Load Word document
Document doc = new Document("document.docx");
// Gather a list of the paragraphs using the respective heading styles.
List<Paragraph> parasStyleHeading1 = ParagraphsByStyleName(doc, "Heading 1");
List<Paragraph> parasStyleHeading3 = ParagraphsByStyleName(doc, "Heading 3");
// Use the first instance of the paragraphs with those styles.
Node startPara1 = (Node)parasStyleHeading1[0];
Node endPara1 = (Node)parasStyleHeading3[0];
// Extract the content between these nodes in the document. Don't include these markers in the extraction.
ArrayList extractedNodes = ExtractContent(startPara1, endPara1, false);
// Insert the content into a new document and save it to disk.
Document dstDoc = GenerateDocument(doc, extractedNodes);
dstDoc.Save("output.docx");

阅读更多关于文本提取

探索通过 Word 文件提取文本的额外场景 此文档文章.

获取免费Word文本提取器图书馆

你可以得到一个 免费临时许可证 提取文本,没有评估限制。

结论

Aspose.Words for .NET 是一个多元化的图书馆,简化了从 Word 文档中提取文本的过程。 凭借其广泛的功能和用户友好的 API,您可以有效地使用 Word 文档,并自动化各种文本提取场景. 无论您正在开发需要 Word 文档处理的应用程序还是简单地提取文本的应用程序,Aspose.Words for .NET 是开发人员的重要工具。

要了解更多 Aspose.Words for .NET 的功能,请查看 人们在说什么如果您有任何疑问,请自由地通过我们的 论坛.

看也

提示: 您可能想查看 Aspose PowerPoint 到 Word 转换器,展示了将演示文稿转化为Word文档的流行过程。

More in this category