Trích xuất Văn bản từ các Tệp OneNote bằng Python

Nếu bạn cần đọc văn bản từ Microsoft OneNote .one các tệp trong một script Python, mà không cần cài đặt Microsoft Office hoặc chạy Windows, Aspose.Note FOSS for Python là giải pháp. Đây là một thư viện hoàn toàn miễn phí, mã nguồn mở, phân tích định dạng nhị phân OneNote trực tiếp và cung cấp một API Python sạch sẽ.

Cài đặt

pip install aspose-note

Không cần khóa API. Không cần tệp giấy phép. Không cần Microsoft Office.

Cách tiếp cận đơn giản nhất: GetChildNodes(RichText)

Văn bản OneNote được lưu trữ trong RichText các nút phân tán trên các trang, đề cương và các phần tử đề cương. GetChildNodes(RichText) thực hiện một tìm kiếm đệ quy trên toàn bộ cây tài liệu và trả về mọi nút văn bản dưới dạng một danh sách phẳng:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    if rt.Text:
        print(rt.Text)

Đây là cách nhanh nhất để lấy toàn bộ nội dung văn bản ra khỏi một .one tệp.

Lưu Văn bản vào Tệp

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
lines = [rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text]

with open("extracted.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(lines))

print(f"Saved {len(lines)} text blocks to extracted.txt")

Trích xuất Văn bản Theo Trang

Khi bạn cần biết mỗi khối văn bản xuất phát từ trang nào:

from aspose.note import Document, Page, RichText

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    title = (
        page.Title.TitleText.Text
        if page.Title and page.Title.TitleText
        else "(untitled)"
    )
    page_texts = [rt.Text for rt in page.GetChildNodes(RichText) if rt.Text]
    print(f"\n=== {title} ===")
    for text in page_texts:
        print(text)

Trích xuất Siêu liên kết

Liên kết siêu văn bản được lưu trên từng TextRun đối tượng trong RichText các nút. Kiểm tra run.Style.IsHyperlink:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        if run.Style.IsHyperlink and run.Style.HyperlinkAddress:
            print(f"{run.Text!r}  ->  {run.Style.HyperlinkAddress}")

Phát hiện Định dạng: Đậm, Nghiêng, Gạch chân

Mỗi TextRun mang định dạng từng ký tự qua TextStyle:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.TextRuns:
        s = run.Style
        if any([s.IsBold, s.IsItalic, s.IsUnderline]):
            flags = ", ".join(f for f, v in [
                ("bold", s.IsBold), ("italic", s.IsItalic), ("underline", s.IsUnderline)
            ] if v)
            print(f"[{flags}] {run.Text.strip()!r}")

Đọc từ một Stream

Hoạt động với lưu trữ đám mây, phần thân phản hồi HTTP, hoặc bộ đệm trong bộ nhớ:

import io, urllib.request
from aspose.note import Document, RichText

##Example: load from bytes already in memory
one_bytes = open("MyNotes.one", "rb").read()
doc = Document(io.BytesIO(one_bytes))
texts = [rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text]
print(f"Extracted {len(texts)} text block(s)")

Khắc phục mã hoá Windows

Trên các terminal Windows, sys.stdout có thể sử dụng mã hoá cũ gây lỗi khi gặp ký tự Unicode. Thêm đoạn này vào đầu script của bạn:

import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

Những gì Thư viện hỗ trợ

Tính năng	Hỗ trợ
Đọc `.one` tệp (đường dẫn hoặc luồng)	Có
Trích xuất `RichText.Text` (văn bản thuần)	Có
Kiểm tra `TextRun.Style` (đậm, nghiêng, siêu liên kết, phông chữ)	Có
Trích xuất văn bản từ các ô bảng	Có
Đọc tiêu đề trang	Có
Ghi lại vào `.one`	Không
Tài liệu được mã hóa	Không