استخراج النص من ملفات OneNote باستخدام Python

إذا كنت بحاجة إلى قراءة النص من Microsoft OneNote .one ملفات في سكريبت Python، دون تثبيت Microsoft Office أو تشغيل Windows،, Aspose.Note FOSS for Python هو الحل. إنها مكتبة مجانية 100٪ ومفتوحة المصدر تقوم بتحليل تنسيق OneNote الثنائي مباشرة وتوفر واجهة برمجة تطبيقات Python نظيفة.

التثبيت

pip install aspose-note

لا مفتاح API. لا ملف ترخيص. لا Microsoft Office.

النهج الأبسط: GetChildNodes(RichText)

يتم تخزين نص OneNote في RichText عُقَد موزعة عبر الصفحات، المخططات، وعناصر المخطط. GetChildNodes(RichText) يُجري بحثًا تكراريًا في شجرة المستند بالكامل ويعيد كل عقدة نصية كقائمة مسطحة:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    if rt.Text:
        print(rt.Text)

هذه هي أسرع طريقة لاستخراج كل محتوى النص من .one ملف.

حفظ النص إلى ملف

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
lines = [rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text]

with open("extracted.txt", "w", encoding="utf-8") as f:
    f.write("\n".join(lines))

print(f"Saved {len(lines)} text blocks to extracted.txt")

نص نص لكل صفحة

عندما تحتاج إلى معرفة الصفحة التي جاءت من كل كتلة نصية:

from aspose.note import Document, Page, RichText

doc = Document("MyNotes.one")
for page in doc.GetChildNodes(Page):
    title = (
        page.Title.TitleText.Text
        if page.Title and page.Title.TitleText
        else "(untitled)"
    )
    page_texts = [rt.Text for rt in page.GetChildNodes(RichText) if rt.Text]
    print(f"\n=== {title} ===")
    for text in page_texts:
        print(text)

استخراج hyperlinks

يتم تخزين الروابط التشعبية على TextRun كائنات داخل RichText العُقَد. تحقق run.Style.IsHyperlink:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.Runs:
        if run.Style.IsHyperlink and run.Style.HyperlinkAddress:
            print(f"{run.Text!r}  ->  {run.Style.HyperlinkAddress}")

تنسيق الكشف: Bold, إيطالي, Underline

كل TextRun يحمل تنسيقًا لكل حرف عبر خاصته TextStyle:

from aspose.note import Document, RichText

doc = Document("MyNotes.one")
for rt in doc.GetChildNodes(RichText):
    for run in rt.Runs:
        s = run.Style
        if any([s.Bold, s.Italic, s.Underline]):
            flags = ", ".join(f for f, v in [
                ("bold", s.Bold), ("italic", s.Italic), ("underline", s.Underline)
            ] if v)
            print(f"[{flags}] {run.Text.strip()!r}")

قراءة من سلك

يعمل مع التخزين السحابي، و HTTP ردود الفعل، أو في الذاكرة:

import io, urllib.request
from aspose.note import Document, RichText

##Example: load from bytes already in memory
one_bytes = open("MyNotes.one", "rb").read()
doc = Document(io.BytesIO(one_bytes))
texts = [rt.Text for rt in doc.GetChildNodes(RichText) if rt.Text]
print(f"Extracted {len(texts)} text block(s)")

Windows إصلاح

في محطات Windows،, sys.stdout قد يستخدم ترميزًا قديمًا يتعطل عند أحرف Unicode. أضف هذا في بداية البرنامج النصي الخاص بك:

import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

ما يدعم المكتبة

الميزة	مدعوم
قراءة `.one` الملفات (مسار أو تدفق)	نعم
استخراج `RichText.Text` (نص عادي)	نعم
فحص `TextRun.Style` (عريض، مائل، رابط تشعبي، خط)	نعم
استخراج النص من خلايا الجدول	نعم
قراءة عناوين الصفحات	نعم
اكتب ردًا إلى `.one`	لا
مستندات مشفرة	لا