Extract Text from MS Word Documents in C#

Microsoft Word documents are widely used for creating and sharing textual content. If you are working with Word documents in your C# applications, you may need to extract text from Word document using C# ASP.NET and extract text preserving formatting. For example, you might want to analyze the text, extract particular sections of a document, or combine them into a single document. In this blog, we will explore how to extract text from Word documents in C# using the best C# library for word document text extraction.

C# Library to Extract Text from Word Documents

Aspose.Words for .NET is a feature-rich and easy-to-use library for working with Word documents. It offers a wide range of capabilities, including .NET word document text extraction, document creation, manipulation, and conversion. With Aspose.Words for .NET, you can handle various aspects of Word documents, making it a valuable tool for developers looking for efficient C# word text extraction.

You can download the DLL or install the library directly from NuGet using the package manager console.

PM> Install-Package Aspose.Words

Extracting Text from Word Documents

An MS Word document consists of various elements that include paragraphs, tables, images, etc. Therefore, the requirements of text extraction could vary from one scenario to another. For example, you may need to extract text from scanned Word document C# or extract text from Word file C# .NET. Each type of element in a Word document is represented as a node. Therefore, to process a document, you will have to interact with the nodes. Let’s begin and see how to extract text from Word documents in different scenarios while ensuring we handle word formatting when extracting text C# effectively.

Extract Text from a Word DOC in C#

In this section, we are going to implement a C# text extractor for Word documents, and the workflow of text extraction would be as follows:

  • First, we will define the nodes that we want to include in the text extraction process.
  • Then, we will extract the content between the specified nodes (including or excluding the starting and ending nodes).
  • Finally, we will use a clone of the extracted nodes, e.g., to create a new Word document consisting of the extracted content.

Let’s now write a method named ExtractContent to which we will pass the nodes and some other parameters to perform the text extraction. This method will parse the document and clone the nodes. The following are the parameters that we will pass to this method:

  1. StartNode and EndNode as starting and ending points for the extraction of the content, respectively. These can be both block level (Paragraph, Table) or inline level (e.g., Run, FieldStart, BookmarkStart, etc.) nodes.
    1. To pass a field, you should pass the corresponding FieldStart object.
    2. To pass bookmarks, the BookmarkStart and BookmarkEnd nodes should be passed.
    3. For comments, the CommentRangeStart and CommentRangeEnd nodes should be used.
  2. IsInclusive defines if the markers are included in the extraction or not. If this option is set to false and the same node or consecutive nodes are passed, then an empty list will be returned.

The following is the complete implementation of the ExtractContent method that extracts the content between the nodes that are passed, accurately extracting text from the protected Word document C#.

Some helper methods are also required by the ExtractContent method to accomplish the text extraction operation, which are given below.

Now we are ready to utilize these methods and extract text from Word document using C#.

Extract Text between Paragraphs of a Word Document

Let’s see how to extract content between two paragraphs in a Word DOCX document. The following steps perform this operation in C#.

  • First, load the Word document using the Document class.
  • Get reference to the starting and ending paragraphs into two objects using Document.FirstSection.Body.GetChild(NodeType.PARAGRAPH, int, boolean) method.
  • Call ExtractContent(startPara, endPara, true) method to extract the nodes into an object.
  • Call GenerateDocument(Document, extractedNodes) helper method to create a document consisting of the extracted content.
  • Finally, save the returned document using Document.Save(string) method.

The following code sample shows how to efficiently extract text from large Word files by extracting text between the 7th and 11th paragraphs in a Word document in C#.

Extract Text between Different Types of Nodes

You can also extract content between different types of nodes. For demonstration, let’s extract content between a paragraph and a table and save it into a new Word document. The following steps perform this operation.

  • Load the Word document using the Document class.
  • Get reference to the starting and ending nodes into two objects using Document.FirstSection.Body.GetChild(NodeType, int, boolean) method.
  • Call ExtractContent(startPara, endPara, true) method to extract the nodes into an object.
  • Call GenerateDocument(Document, extractedNodes) helper method to create a document consisting of the extracted content.
  • Save the returned document using Document.Save(string) method.

The following code sample shows how to extract text between a paragraph and a table in C#.

Fetch Text between Paragraphs based on Styles

Let’s now check out how to extract content between paragraphs based on styles. For demonstration, we are going to extract content between the first “Heading 1” and the first “Heading 3” in the Word document. The following steps demonstrate how to achieve this in C#.

  • First, load the Word document using the Document class.
  • Then, extract paragraphs into an object using ParagraphsByStyleName(Document, “Heading 1”) helper method.
  • Extract paragraphs into another object using ParagraphsByStyleName(Document, “Heading 3”) helper method.
  • Call ExtractContent(startPara, endPara, true) method and pass the first elements in both paragraph arrays as first and second parameters.
  • Call GenerateDocument(Document, extractedNodes) helper method to create a document consisting of the extracted content.
  • Finally, save the returned document using Document.Save(string) method.

The following code sample shows how to extract content between paragraphs based on styles.

Read More about Text Extraction

You can explore other scenarios of the .NET API for Word document text extraction using this documentation article.

Get Free Word Text Extractor Library

You can get a free temporary license to extract text without evaluation limitations.

Conclusion

Aspose.Words for .NET is a versatile library that simplifies the process of C# extract text from Word preserving formatting. With its extensive features and easy-to-use API, you can efficiently work with Word documents and automate different scenarios of handling special characters during C# word text extraction. Whether you’re building applications that need to process Word documents or simply extracting text, Aspose.Words for .NET is a valuable tool for developers.

You can explore other features of Aspose.Words for .NET using the documentation. In case you have any questions, feel free to let us know via our forum.

See Also

Tip: You may want to check Aspose PowerPoint to Word Converter because it demonstrates the popular presentation to Word document conversion process.