Extract Text from MS Word Documents in C#

Microsoft Word documents are a staple for creating and sharing textual content. If you’re developing C# applications that interact with these documents, you might find yourself needing to extract text from them. This could be for purposes such as text analysis or extracting specific sections of a document to compile into a new one. In this blog post, we will dive into the methods for extracting text from Word documents in C#.

Table of Contents

C# Library for Text Extraction

Aspose.Words for .NET is a powerful and user-friendly library designed for working with Word documents. It provides a comprehensive set of features, including text extraction, document creation, manipulation, and conversion. With Aspose.Words for .NET, developers can efficiently manage various aspects of Word documents, making it an invaluable tool for your development needs.

To get started, download the library or install it directly from NuGet using the following command in the package manager console:

PM> Install-Package Aspose.Words

Understanding Text Extraction in Word Documents

An MS Word document comprises various elements such as paragraphs, tables, and images. Consequently, the requirements for text extraction can differ based on the specific use case. You may need to extract text between paragraphs, bookmarks, comments, and more.

Each element in a Word document is represented as a node. Therefore, to effectively process a document, you will need to work with these nodes. Let’s explore how to extract text from Word documents in different scenarios.

Step-by-Step Guide to Extract Text from a Word Document

In this section, we will implement a C# text extractor for Word documents. The workflow for text extraction will involve the following steps:

  1. Define the nodes to include in the extraction process.
  2. Extract the content between the specified nodes (including or excluding the starting and ending nodes).
  3. Use the cloned extracted nodes to create a new Word document containing the extracted content.

Let’s create a method named ExtractContent that will accept nodes and other parameters to perform the text extraction. This method will parse the document and clone the nodes based on the following parameters:

  • StartNode and EndNode: These define the starting and ending points for content extraction. They can be block-level (e.g., Paragraph, Table) or inline-level nodes (e.g., Run, FieldStart, BookmarkStart).
    • For fields, pass the corresponding FieldStart object.
    • For bookmarks, use BookmarkStart and BookmarkEnd nodes.
    • For comments, employ CommentRangeStart and CommentRangeEnd nodes.
  • IsInclusive: This parameter determines whether the markers are included in the extraction. If set to false and the same or consecutive nodes are provided, an empty list will be returned.

Here is the complete implementation of the ExtractContent method to extract content between the specified nodes:

public static ArrayList ExtractContent(Node startNode, Node endNode, bool isInclusive)
{
// First check that the nodes passed to this method are valid for use.
VerifyParameterNodes(startNode, endNode);
// Create a list to store the extracted nodes.
ArrayList nodes = new ArrayList();
// Keep a record of the original nodes passed to this method so we can split marker nodes if needed.
Node originalStartNode = startNode;
Node originalEndNode = endNode;
// Extract content based on block level nodes (paragraphs and tables). Traverse through parent nodes to find them.
// We will split the content of first and last nodes depending if the marker nodes are inline
while (startNode.ParentNode.NodeType != NodeType.Body)
startNode = startNode.ParentNode;
while (endNode.ParentNode.NodeType != NodeType.Body)
endNode = endNode.ParentNode;
bool isExtracting = true;
bool isStartingNode = true;
bool isEndingNode = false;
// The current node we are extracting from the document.
Node currNode = startNode;
// Begin extracting content. Process all block level nodes and specifically split the first and last nodes when needed so paragraph formatting is retained.
// Method is little more complex than a regular extractor as we need to factor in extracting using inline nodes, fields, bookmarks etc as to make it really useful.
while (isExtracting)
{
// Clone the current node and its children to obtain a copy.
Node cloneNode = currNode.Clone(true);
isEndingNode = currNode.Equals(endNode);
if ((isStartingNode || isEndingNode) && cloneNode.IsComposite)
{
// We need to process each marker separately so pass it off to a separate method instead.
if (isStartingNode)
{
ProcessMarker((CompositeNode)cloneNode, nodes, originalStartNode, isInclusive, isStartingNode, isEndingNode);
isStartingNode = false;
}
// Conditional needs to be separate as the block level start and end markers maybe the same node.
if (isEndingNode)
{
ProcessMarker((CompositeNode)cloneNode, nodes, originalEndNode, isInclusive, isStartingNode, isEndingNode);
isExtracting = false;
}
}
else
// Node is not a start or end marker, simply add the copy to the list.
nodes.Add(cloneNode);
// Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section.
if (currNode.NextSibling == null && isExtracting)
{
// Move to the next section.
Section nextSection = (Section)currNode.GetAncestor(NodeType.Section).NextSibling;
currNode = nextSection.Body.FirstChild;
}
else
{
// Move to the next node in the body.
currNode = currNode.NextSibling;
}
}
// Return the nodes between the node markers.
return nodes;
}
view raw extract-text.cs hosted with ❤ by GitHub

Additionally, some helper methods are required by the ExtractContent method to facilitate the text extraction operation:

public static List<Paragraph> ParagraphsByStyleName(Document doc, string styleName)
{
// Create an array to collect paragraphs of the specified style.
List<Paragraph> paragraphsWithStyle = new List<Paragraph>();
NodeCollection paragraphs = doc.GetChildNodes(NodeType.Paragraph, true);
// Look through all paragraphs to find those with the specified style.
foreach (Paragraph paragraph in paragraphs)
{
if (paragraph.ParagraphFormat.Style.Name == styleName)
paragraphsWithStyle.Add(paragraph);
}
return paragraphsWithStyle;
}
private static void VerifyParameterNodes(Node startNode, Node endNode)
{
// The order in which these checks are done is important.
if (startNode == null)
throw new ArgumentException("Start node cannot be null");
if (endNode == null)
throw new ArgumentException("End node cannot be null");
if (!startNode.Document.Equals(endNode.Document))
throw new ArgumentException("Start node and end node must belong to the same document");
if (startNode.GetAncestor(NodeType.Body) == null || endNode.GetAncestor(NodeType.Body) == null)
throw new ArgumentException("Start node and end node must be a child or descendant of a body");
// Check the end node is after the start node in the DOM tree
// First check if they are in different sections, then if they're not check their position in the body of the same section they are in.
Section startSection = (Section)startNode.GetAncestor(NodeType.Section);
Section endSection = (Section)endNode.GetAncestor(NodeType.Section);
int startIndex = startSection.ParentNode.IndexOf(startSection);
int endIndex = endSection.ParentNode.IndexOf(endSection);
if (startIndex == endIndex)
{
if (startSection.Body.IndexOf(startNode) > endSection.Body.IndexOf(endNode))
throw new ArgumentException("The end node must be after the start node in the body");
}
else if (startIndex > endIndex)
throw new ArgumentException("The section of end node must be after the section start node");
}
private static bool IsInline(Node node)
{
// Test if the node is desendant of a Paragraph or Table node and also is not a paragraph or a table a paragraph inside a comment class which is decesant of a pararaph is possible.
return ((node.GetAncestor(NodeType.Paragraph) != null || node.GetAncestor(NodeType.Table) != null) && !(node.NodeType == NodeType.Paragraph || node.NodeType == NodeType.Table));
}
private static void ProcessMarker(CompositeNode cloneNode, ArrayList nodes, Node node, bool isInclusive, bool isStartMarker, bool isEndMarker)
{
// If we are dealing with a block level node just see if it should be included and add it to the list.
if (!IsInline(node))
{
// Don't add the node twice if the markers are the same node
if (!(isStartMarker && isEndMarker))
{
if (isInclusive)
nodes.Add(cloneNode);
}
return;
}
// If a marker is a FieldStart node check if it's to be included or not.
// We assume for simplicity that the FieldStart and FieldEnd appear in the same paragraph.
if (node.NodeType == NodeType.FieldStart)
{
// If the marker is a start node and is not be included then skip to the end of the field.
// If the marker is an end node and it is to be included then move to the end field so the field will not be removed.
if ((isStartMarker && !isInclusive) || (!isStartMarker && isInclusive))
{
while (node.NextSibling != null && node.NodeType != NodeType.FieldEnd)
node = node.NextSibling;
}
}
// If either marker is part of a comment then to include the comment itself we need to move the pointer forward to the Comment
// Node found after the CommentRangeEnd node.
if (node.NodeType == NodeType.CommentRangeEnd)
{
while (node.NextSibling != null && node.NodeType != NodeType.Comment)
node = node.NextSibling;
}
// Find the corresponding node in our cloned node by index and return it.
// If the start and end node are the same some child nodes might already have been removed. Subtract the
// Difference to get the right index.
int indexDiff = node.ParentNode.ChildNodes.Count - cloneNode.ChildNodes.Count;
// Child node count identical.
if (indexDiff == 0)
node = cloneNode.ChildNodes[node.ParentNode.IndexOf(node)];
else
node = cloneNode.ChildNodes[node.ParentNode.IndexOf(node) - indexDiff];
// Remove the nodes up to/from the marker.
bool isSkip = false;
bool isProcessing = true;
bool isRemoving = isStartMarker;
Node nextNode = cloneNode.FirstChild;
while (isProcessing && nextNode != null)
{
Node currentNode = nextNode;
isSkip = false;
if (currentNode.Equals(node))
{
if (isStartMarker)
{
isProcessing = false;
if (isInclusive)
isRemoving = false;
}
else
{
isRemoving = true;
if (isInclusive)
isSkip = true;
}
}
nextNode = nextNode.NextSibling;
if (isRemoving && !isSkip)
currentNode.Remove();
}
// After processing the composite node may become empty. If it has don't include it.
if (!(isStartMarker && isEndMarker))
{
if (cloneNode.HasChildNodes)
nodes.Add(cloneNode);
}
}
public static Document GenerateDocument(Document srcDoc, ArrayList nodes)
{
// Create a blank document.
Document dstDoc = new Document();
// Remove the first paragraph from the empty document.
dstDoc.FirstSection.Body.RemoveAllChildren();
// Import each node from the list into the new document. Keep the original formatting of the node.
NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KeepSourceFormatting);
foreach (Node node in nodes)
{
Node importNode = importer.ImportNode(node, true);
dstDoc.FirstSection.Body.AppendChild(importNode);
}
// Return the generated document.
return dstDoc;
}

Now that we have our methods ready, we can proceed to extract text from a Word document.

Extracting Text Between Paragraphs of a Word Document

To extract content between two paragraphs in a Word DOCX document, follow these steps:

  1. Load the Word document using the Document class.
  2. Get references to the starting and ending paragraphs using the Document.FirstSection.Body.GetChild(NodeType.PARAGRAPH, int, boolean) method.
  3. Call the ExtractContent(startPara, endPara, True) method to extract the nodes into an object.
  4. Use the GenerateDocument(Document, extractedNodes) helper method to create a document with the extracted content.
  5. Save the new document using the Document.Save(string) method.

Here’s a code sample demonstrating how to extract text between the 7th and 11th paragraphs in a Word document:

// Load Word document
Document doc = new Document("document.docx");
// Gather the nodes (the GetChild method uses 0-based index)
Paragraph startPara = (Paragraph)doc.FirstSection.Body.GetChild(NodeType.Paragraph, 6, true);
Paragraph endPara = (Paragraph)doc.FirstSection.Body.GetChild(NodeType.Paragraph, 10, true);
// Extract the content between these nodes in the document. Include these markers in the extraction.
ArrayList extractedNodes = ExtractContent(startPara, endPara, true);
// Insert the content into a new document and save it to disk.
Document dstDoc = GenerateDocument(doc, extractedNodes);
dstDoc.Save("output.docx");

Extracting Text Between Different Types of Nodes

You can also extract content between different types of nodes. For example, let’s extract content between a paragraph and a table and save it into a new Word document. The steps are as follows:

  1. Load the Word document using the Document class.
  2. Get references to the starting and ending nodes using the Document.FirstSection.Body.GetChild(NodeType, int, boolean) method.
  3. Call ExtractContent(startPara, endPara, True) to extract the nodes into an object.
  4. Use the GenerateDocument(Document, extractedNodes) helper method to create a document with the extracted content.
  5. Save the new document using Document.Save(string).

Here’s the code sample for extracting text between a paragraph and a table in C#:

// Load Word document
Document doc = new Document("document.docx");
Paragraph startPara = (Paragraph)doc.LastSection.GetChild(NodeType.Paragraph, 2, true);
Table endTable = (Table)doc.LastSection.GetChild(NodeType.Table, 0, true);
// Extract the content between these nodes in the document. Include these markers in the extraction.
ArrayList extractedNodes = ExtractContent(startPara, endTable, true);
// Insert the content into a new document and save it to disk.
Document dstDoc = GenerateDocument(doc, extractedNodes);
dstDoc.Save("output.docx");

Extracting Text Based on Styles

To extract content between paragraphs based on styles, follow these steps. For this demonstration, we will extract content between the first “Heading 1” and the first “Heading 3” in the Word document:

  1. Load the Word document using the Document class.
  2. Extract paragraphs into an object using the ParagraphsByStyleName(Document, “Heading 1”) helper method.
  3. Extract paragraphs into another object using ParagraphsByStyleName(Document, “Heading 3”).
  4. Call ExtractContent(startPara, endPara, True) with the first elements from both paragraph arrays.
  5. Use the GenerateDocument(Document, extractedNodes) helper method to create a document with the extracted content.
  6. Save the new document using Document.Save(string).

Here’s a code sample to extract content between paragraphs based on styles:

// Load Word document
Document doc = new Document("document.docx");
// Gather a list of the paragraphs using the respective heading styles.
List<Paragraph> parasStyleHeading1 = ParagraphsByStyleName(doc, "Heading 1");
List<Paragraph> parasStyleHeading3 = ParagraphsByStyleName(doc, "Heading 3");
// Use the first instance of the paragraphs with those styles.
Node startPara1 = (Node)parasStyleHeading1[0];
Node endPara1 = (Node)parasStyleHeading3[0];
// Extract the content between these nodes in the document. Don't include these markers in the extraction.
ArrayList extractedNodes = ExtractContent(startPara1, endPara1, false);
// Insert the content into a new document and save it to disk.
Document dstDoc = GenerateDocument(doc, extractedNodes);
dstDoc.Save("output.docx");

Read More About Text Extraction

Explore additional scenarios for extracting text from Word documents through this documentation article.

Get a Free Word Text Extractor Library

You can obtain a free temporary license to extract text without evaluation limitations.

Conclusion

Aspose.Words for .NET is a versatile library that streamlines the process of extracting text from Word documents in C#. With its extensive features and user-friendly API, you can efficiently work with Word documents and automate various text extraction scenarios. Whether you’re developing applications that require Word document processing or simply extracting text, Aspose.Words for .NET is an essential tool for developers.

To explore more features of Aspose.Words for .NET, check out the documentation. If you have any questions, feel free to reach out via our forum.

See Also

Tip: You may want to check out the Aspose PowerPoint to Word Converter, which demonstrates the popular process of converting presentations to Word documents.

More in this category