从 Word 文档中提取 C# 的文本

Extract Text from MS Word Documents in C#

Microsoft Word 文档是创建和共享文本内容的步骤. 如果您正在开发与这些文档互动的 C# 应用程序,您可能会发现自己需要从这些文档中提取文本。这可能是用于文本分析或提取文档的特定部分,以编译到一个新的文档。

内容表

C# 文本提取图书馆 {# 文本提取文本从词文档}

ASPOSE.Words 为 .NET 它提供了一系列功能,包括文本提取,文档创建,操作和转换. 通过 Aspose.Words 为 .NET,开发人员可以有效地管理 Word 文件的各个方面,使其为您的开发需求提供无价值的工具。

要开始, 下载图书馆或者直接从诺基亚在包管理器控制台中使用下列命令:

PM> Install-Package Aspose.Words

理解文本提取在Word文档

MS Word 文档包含各种元素,如段落,表和图像. 因此,文本提取的要求可以根据特定使用案例有所不同. 您可能需要在段落,图标,评论等之间提取文本。

因此,要有效处理文档,您将需要使用这些节点,让我们来探索如何在不同的情况下从文档中提取文本。

步骤指南从Word文档中提取文本

在此部分中,我们将实施Word文档的C#文本提取器,文本提取工作流将包括以下步骤:

定义要在提取过程中包含的节点。
在指定的节点之间提取内容(包括或排除开始和结束节点)。
使用克隆提取的节点创建一个包含提取内容的新Word文档。

让我们创建一个名为 ExtractContent 的方法,它会接受节点和其他参数来执行文本提取。

StartNode 和 EndNode:这些定义了内容提取的起点和结束点,这些点可以是区块级(例如, 段, 表)或内线级节点(例如, Run, FieldStart, BookmarkStart)。- 对于字段,请输入相应的 FieldStart 对象。
对于图标,请使用 BookmarkStart 和 BookmarkEnd 节点。
对于评论,请使用 CommentRangeStart 和 CommentRangeEnd 节点。
IsInclusive: 此参数确定标记是否包含在提取中. 如果设置为虚假,并提供相同或连续的节点,则将返回一个空白列表。

以下是 ExtractContent 方法的完整实施,以便在指定的节点之间提取内容:

	public static ArrayList ExtractContent(Node startNode, Node endNode, bool isInclusive)
	{
	// First check that the nodes passed to this method are valid for use.
	VerifyParameterNodes(startNode, endNode);

	// Create a list to store the extracted nodes.
	ArrayList nodes = new ArrayList();

	// Keep a record of the original nodes passed to this method so we can split marker nodes if needed.
	Node originalStartNode = startNode;
	Node originalEndNode = endNode;

	// Extract content based on block level nodes (paragraphs and tables). Traverse through parent nodes to find them.
	// We will split the content of first and last nodes depending if the marker nodes are inline
	while (startNode.ParentNode.NodeType != NodeType.Body)
	startNode = startNode.ParentNode;

	while (endNode.ParentNode.NodeType != NodeType.Body)
	endNode = endNode.ParentNode;

	bool isExtracting = true;
	bool isStartingNode = true;
	bool isEndingNode = false;
	// The current node we are extracting from the document.
	Node currNode = startNode;

	// Begin extracting content. Process all block level nodes and specifically split the first and last nodes when needed so paragraph formatting is retained.
	// Method is little more complex than a regular extractor as we need to factor in extracting using inline nodes, fields, bookmarks etc as to make it really useful.
	while (isExtracting)
	{
	// Clone the current node and its children to obtain a copy.
	Node cloneNode = currNode.Clone(true);
	isEndingNode = currNode.Equals(endNode);

	if ((isStartingNode \|\| isEndingNode) && cloneNode.IsComposite)
	{
	// We need to process each marker separately so pass it off to a separate method instead.
	if (isStartingNode)
	{
	ProcessMarker((CompositeNode)cloneNode, nodes, originalStartNode, isInclusive, isStartingNode, isEndingNode);
	isStartingNode = false;
	}

	// Conditional needs to be separate as the block level start and end markers maybe the same node.
	if (isEndingNode)
	{
	ProcessMarker((CompositeNode)cloneNode, nodes, originalEndNode, isInclusive, isStartingNode, isEndingNode);
	isExtracting = false;
	}
	}
	else
	// Node is not a start or end marker, simply add the copy to the list.
	nodes.Add(cloneNode);

	// Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section.
	if (currNode.NextSibling == null && isExtracting)
	{
	// Move to the next section.
	Section nextSection = (Section)currNode.GetAncestor(NodeType.Section).NextSibling;
	currNode = nextSection.Body.FirstChild;
	}
	else
	{
	// Move to the next node in the body.
	currNode = currNode.NextSibling;
	}
	}

	// Return the nodes between the node markers.
	return nodes;
	}

view raw extract-text.cs hosted with ❤ by GitHub

此外,通过 ExtractContent 方法需要一些辅助方法来促进文本提取操作:

	public static List<Paragraph> ParagraphsByStyleName(Document doc, string styleName)
	{
	// Create an array to collect paragraphs of the specified style.
	List<Paragraph> paragraphsWithStyle = new List<Paragraph>();

	NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);

	// Look through all paragraphs to find those with the specified style.
	foreach (Paragraph paragraph in paragraphs)
	{
	if (paragraph.ParagraphFormat.Style.Name == styleName)
	paragraphsWithStyle.Add(paragraph);
	}

	return paragraphsWithStyle;
	}
	private static void VerifyParameterNodes(Node startNode, Node endNode)
	{
	// The order in which these checks are done is important.
	if (startNode == null)
	throw new ArgumentException("Start node cannot be null");
	if (endNode == null)
	throw new ArgumentException("End node cannot be null");

	if (!startNode.Document.Equals(endNode.Document))
	throw new ArgumentException("Start node and end node must belong to the same document");

	if (startNode.GetAncestor(NodeType.Body) == null \|\| endNode.GetAncestor(NodeType.Body) == null)
	throw new ArgumentException("Start node and end node must be a child or descendant of a body");

	// Check the end node is after the start node in the DOM tree
	// First check if they are in different sections, then if they're not check their position in the body of the same section they are in.
	Section startSection = (Section)startNode.GetAncestor(NodeType.Section);
	Section endSection = (Section)endNode.GetAncestor(NodeType.Section);

	int startIndex = startSection.ParentNode.IndexOf(startSection);
	int endIndex = endSection.ParentNode.IndexOf(endSection);

	if (startIndex == endIndex)
	{
	if (startSection.Body.IndexOf(startNode) > endSection.Body.IndexOf(endNode))
	throw new ArgumentException("The end node must be after the start node in the body");
	}
	else if (startIndex > endIndex)
	throw new ArgumentException("The section of end node must be after the section start node");
	}
	private static bool IsInline(Node node)
	{
	// Test if the node is desendant of a Paragraph or Table node and also is not a paragraph or a table a paragraph inside a comment class which is decesant of a pararaph is possible.
	return ((node.GetAncestor(NodeType.Paragraph) != null \|\| node.GetAncestor(NodeType.Table) != null) && !(node.NodeType == NodeType.Paragraph \|\| node.NodeType == NodeType.Table));
	}
	private static void ProcessMarker(CompositeNode cloneNode, ArrayList nodes, Node node, bool isInclusive, bool isStartMarker, bool isEndMarker)
	{
	// If we are dealing with a block level node just see if it should be included and add it to the list.
	if (!IsInline(node))
	{
	// Don't add the node twice if the markers are the same node
	if (!(isStartMarker && isEndMarker))
	{
	if (isInclusive)
	nodes.Add(cloneNode);
	}
	return;
	}

	// If a marker is a FieldStart node check if it's to be included or not.
	// We assume for simplicity that the FieldStart and FieldEnd appear in the same paragraph.
	if (node.NodeType == NodeType.FieldStart)
	{
	// If the marker is a start node and is not be included then skip to the end of the field.
	// If the marker is an end node and it is to be included then move to the end field so the field will not be removed.
	if ((isStartMarker && !isInclusive) \|\| (!isStartMarker && isInclusive))
	{
	while (node.NextSibling != null && node.NodeType != NodeType.FieldEnd)
	node = node.NextSibling;

	}
	}

	// If either marker is part of a comment then to include the comment itself we need to move the pointer forward to the Comment
	// Node found after the CommentRangeEnd node.
	if (node.NodeType == NodeType.CommentRangeEnd)
	{
	while (node.NextSibling != null && node.NodeType != NodeType.Comment)
	node = node.NextSibling;

	}

	// Find the corresponding node in our cloned node by index and return it.
	// If the start and end node are the same some child nodes might already have been removed. Subtract the
	// Difference to get the right index.
	int indexDiff = node.ParentNode.ChildNodes.Count - cloneNode.ChildNodes.Count;

	// Child node count identical.
	if (indexDiff == 0)
	node = cloneNode.ChildNodes[node.ParentNode.IndexOf(node)];
	else
	node = cloneNode.ChildNodes[node.ParentNode.IndexOf(node) - indexDiff];

	// Remove the nodes up to/from the marker.
	bool isSkip = false;
	bool isProcessing = true;
	bool isRemoving = isStartMarker;
	Node nextNode = cloneNode.FirstChild;

	while (isProcessing && nextNode != null)
	{
	Node currentNode = nextNode;
	isSkip = false;

	if (currentNode.Equals(node))
	{
	if (isStartMarker)
	{
	isProcessing = false;
	if (isInclusive)
	isRemoving = false;
	}
	else
	{
	isRemoving = true;
	if (isInclusive)
	isSkip = true;
	}
	}

	nextNode = nextNode.NextSibling;
	if (isRemoving && !isSkip)
	currentNode.Remove();
	}

	// After processing the composite node may become empty. If it has don't include it.
	if (!(isStartMarker && isEndMarker))
	{
	if (cloneNode.HasChildNodes)
	nodes.Add(cloneNode);
	}

	}
	public static Document GenerateDocument(Document srcDoc, ArrayList nodes)
	{
	// Create a blank document.
	Document dstDoc = new Document();
	// Remove the first paragraph from the empty document.
	dstDoc.FirstSection.Body.RemoveAllChildren();

	// Import each node from the list into the new document. Keep the original formatting of the node.
	NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KeepSourceFormatting);

	foreach (Node node in nodes)
	{
	Node importNode = importer.ImportNode(node, true);
	dstDoc.FirstSection.Body.AppendChild(importNode);
	}

	// Return the generated document.
	return dstDoc;
	}

view raw text-extraction-helpers.cs hosted with ❤ by GitHub

现在我们已经准备好了我们的方法,我们可以继续从Word文档中提取文本。

文本文本文本文本文本文本文本文本文本文本文本文本

要在 Word DOCX 文档中提取两个段落之间的内容,请遵循以下步骤:

使用文档类下载 Word 文档。
使用 Document.FirstSection.Body.GetChild(NodeType.PARAGRAPH, int, boolean) 方法获取开始和结束段落的参考。
使用 ExtractContent(startPara、endPara、True) 方法将节点提取到一个对象中。
使用 GenerateDocument(Document, extractedNodes) 辅助方法来创建包含所提取内容的文档。
使用 Document.Save(string) 方法保存新文档。

下面是一个代码样本,表明如何在Word文档中提取第7至第11段之间的文本:

	// Load Word document
	Document doc = new Document("document.docx");

	// Gather the nodes (the GetChild method uses 0-based index)
	Paragraph startPara = (Paragraph)doc.FirstSection.Body.GetChild(NodeType.Paragraph, 6, true);
	Paragraph endPara = (Paragraph)doc.FirstSection.Body.GetChild(NodeType.Paragraph, 10, true);

	// Extract the content between these nodes in the document. Include these markers in the extraction.
	ArrayList extractedNodes = ExtractContent(startPara, endPara, true);

	// Insert the content into a new document and save it to disk.
	Document dstDoc = GenerateDocument(doc, extractedNodes);
	dstDoc.Save("output.docx");

view raw extract-text-paragraphs.cs hosted with ❤ by GitHub

不同类型的节点之间的文本提取 {# 文本提取 - 不同类型的节点}

您也可以在不同类型的节点之间提取内容. 例如,让我们在段落和表之间提取内容并将其存储在一个新的 Word 文档中。

使用文档类下载 Word 文档。
使用 Document.FirstSection.Body.GetChild(NodeType, int, boolean) 方法获取开始和结束节点的参考。
呼叫 **ExtractContent(startPara,endPara,True)**将节点提取到一个对象中。
使用 GenerateDocument(Document, extractedNodes) 辅助方法来创建包含所提取内容的文档。
使用 Document.Save(string) 保存新文件。

下面是编码样本,以便在 C# 中的段落和表之间提取文本:

	// Load Word document
	Document doc = new Document("document.docx");

	Paragraph startPara = (Paragraph)doc.LastSection.GetChild(NodeType.Paragraph, 2, true);
	Table endTable = (Table)doc.LastSection.GetChild(NodeType.Table, 0, true);

	// Extract the content between these nodes in the document. Include these markers in the extraction.
	ArrayList extractedNodes = ExtractContent(startPara, endTable, true);

	// Insert the content into a new document and save it to disk.
	Document dstDoc = GenerateDocument(doc, extractedNodes);
	dstDoc.Save("output.docx");

view raw extract-text-nodes.cs hosted with ❤ by GitHub

以风格为基础的文本提取 {# 以风格为基础的文本提取文本}

要在基于风格的段落之间提取内容,请遵循以下步骤. 为此示范,我们将在Word文档中的第一个“标题1”和第一个“标题3”之间提取内容:

使用文档类下载 Word 文档。
将段落提取到一个对象,使用 ParagraphsByStyleName(文档,“标题 1”) 辅助方法。
将段落提取到另一个对象,使用 ParagraphsByStyleName(文件,“标题”。
呼叫 ExtractContent(startPara,endPara,True) 与第一部分从两个段落的序列。
使用 GenerateDocument(Document, extractedNodes) 辅助方法来创建包含所提取内容的文档。
使用 Document.Save(string) 保存新文件。

下面是基于风格的段落之间的内容提取的代码样本:

	// Load Word document
	Document doc = new Document("document.docx");

	// Gather a list of the paragraphs using the respective heading styles.
	List<Paragraph> parasStyleHeading1 = ParagraphsByStyleName(doc, "Heading 1");
	List<Paragraph> parasStyleHeading3 = ParagraphsByStyleName(doc, "Heading 3");

	// Use the first instance of the paragraphs with those styles.
	Node startPara1 = (Node)parasStyleHeading1[0];
	Node endPara1 = (Node)parasStyleHeading3[0];

	// Extract the content between these nodes in the document. Don't include these markers in the extraction.
	ArrayList extractedNodes = ExtractContent(startPara1, endPara1, false);

	// Insert the content into a new document and save it to disk.
	Document dstDoc = GenerateDocument(doc, extractedNodes);
	dstDoc.Save("output.docx");

view raw extract-text-paragraphs-styles.cs hosted with ❤ by GitHub

获取免费Word文本提取器图书馆

你可以得到一个免费临时许可证提取文本,没有评估限制。

结论

Aspose.Words for .NET 是一个多元化的图书馆,简化了从 Word 文档中提取文本的过程。凭借其广泛的功能和用户友好的 API,您可以有效地使用 Word 文档,并自动化各种文本提取场景. 无论您正在开发需要 Word 文档处理的应用程序还是简单地提取文本的应用程序,Aspose.Words for .NET 是开发人员的重要工具。

要了解更多 Aspose.Words for .NET 的功能,请查看人们在说什么如果您有任何疑问,请自由地通过我们的论坛.

看也

提示: 您可能想查看 Aspose PowerPoint 到 Word 转换器,展示了将演示文稿转化为Word文档的流行过程。

内容表

C# 文本提取图书馆 {# 文本提取文本从词文档}

理解文本提取在Word文档

步骤指南从Word文档中提取文本

文本文本文本文本文本文本文本文本文本文本文本文本

不同类型的节点之间的文本提取 {# 文本提取 - 不同类型的节点}

以风格为基础的文本提取 {# 以风格为基础的文本提取文本}

阅读更多关于文本提取

获取免费Word文本提取器图书馆

结论

看也

More in this category

内容表#

C# 文本提取图书馆 {# 文本提取文本从词文档}#

理解文本提取在Word文档#

步骤指南从Word文档中提取文本#

文本文本文本文本文本文本文本文本文本文本文本文本#

不同类型的节点之间的文本提取 {# 文本提取 - 不同类型的节点}#

以风格为基础的文本提取 {# 以风格为基础的文本提取文本}#

阅读更多关于文本提取#

获取免费Word文本提取器图书馆#

结论#

看也#

More in this category

内容表

C# 文本提取图书馆 {# 文本提取文本从词文档}

理解文本提取在Word文档

步骤指南从Word文档中提取文本

文本文本文本文本文本文本文本文本文本文本文本文本

不同类型的节点之间的文本提取 {# 文本提取 - 不同类型的节点}

以风格为基础的文本提取 {# 以风格为基础的文本提取文本}

阅读更多关于文本提取

获取免费Word文本提取器图书馆

结论

看也