Docx Loader

Docx Loader¶

WordLoader ¶

Bases: UnstructuredLoader

从word文档中读取数据

需要注意即使指定了 min_chunk_size 也不一定能保证最终输出的元素长度严格符合这个 min_chunk_size。因为在调用元素时，会进行一次判断，如果元素与相邻元素不属于同类元素，则不进行合并。

这种设计的原因还是希望最大限度地保持文本的元数据信息，在不破坏元数据或者最小化损失元数据的情况下，微调元素的大小以避免过短的元素在召回时单个词组干扰过大的情况。也避免过长的元素在召回时相似度过低的情况（这类情况在Embedding的时候也有对应优化）。更准确的大小限制，需要在召回的时候进行处理。

Attributes:

Name	Type	Description
`mime_types`	`set[Literal[DOC, DOCX]]`	支持的mime类型默认值为 {DOC, DOCX}

load ¶

load(path_or_uri: str, *, file: bytes | IO[bytes] | None = None, content_type: Optional[str] = None, **kwargs: Any) -> Document

从给定的 URI 或文件对象加载文档并返回 Document 对象 / Load document from the given URI or file object and return Document.

Parameters:

Name	Type	Description	Default
`path_or_uri`	`str`	文档的路径或 URI （必填）/ File path or uri (required)	required
`file`	`bytes \| IO[bytes] \| None`	文件内容，可以是 bytes 或 IO[bytes] 对象（与 path_or_uri 参数二选一）/ File content as bytes or IO[bytes] (mutually exclusive with path_or_uri).	`None`
`content_type`	`Optional[str]`	文档的内容类型（MIME 类型），如果不提供则尝试自动检测 / The content type (MIME type), auto-detected if not provided.	`None`
`**kwargs`	`Any`	其他可选参数，传递给底层的分片函数 / Additional keyword arguments passed to the partition function.	`{}`

Returns:

Name	Type	Description
`Document`	`Document`	返回一个包含文档内容的 Document 对象 / The loaded document.

Raises:

Type	Description
`ValueError`	如果 path_or_uri 和 file 都未提供，或都提供了 / If neither or both path_or_uri and file are provided.

Source code in tfrobot/utils/document_loaders/docx.py

def load(
    self,
    path_or_uri: str,
    *,
    file: bytes | IO[bytes] | None = None,
    content_type: Optional[str] = None,
    **kwargs: Any,
) -> Document:
    """
    从给定的 URI 或文件对象加载文档并返回 Document 对象 / Load document from the given URI or file object and return Document.

    Args:
        path_or_uri: 文档的路径或 URI （必填）/ File path or uri (required)
        file: 文件内容，可以是 bytes 或 IO[bytes] 对象（与 path_or_uri 参数二选一）/ File content as bytes or IO[bytes] (mutually exclusive with path_or_uri).
        content_type: 文档的内容类型（MIME 类型），如果不提供则尝试自动检测 / The content type (MIME type), auto-detected if not provided.
        **kwargs: 其他可选参数，传递给底层的分片函数 / Additional keyword arguments passed to the partition function.

    Returns:
        Document: 返回一个包含文档内容的 Document 对象 / The loaded document.

    Raises:
        ValueError: 如果 path_or_uri 和 file 都未提供，或都提供了 / If neither or both path_or_uri and file are provided.
    """
    uri = convert_to_file_url(path_or_uri)
    # 处理文件对象 / Handle file object
    file_obj = BytesIO(file) if isinstance(file, bytes) else file if file is not None else self.get_file_obj(uri)
    unique_element_ids = kwargs.pop("unique_element_ids", True)  # 是否使用唯一元素ID
    filename = kwargs.pop(
        "filename", get_filename_from_uri(uri)
    )  # 因为TFRobot partition封装均是使用IO流进行解析，如果再传入filename UnstructuredIO会报错。
    kwargs.pop("url", None)  # 因为TFRobot partition封装均是使用IO流进行解析，如果再传入url UnstructuredIO会报错。
    language = kwargs.pop("language", ["zho"])  # 语言参数，默认为中文
    if not unique_element_ids:
        raise ValueError("TFRobot必须启用唯一元素ID以确保正确处理文档。")
    if content_type == "application/msword":
        from unstructured.partition.doc import partition_doc

        els = partition_doc(file=file_obj, language=language, unique_element_ids=unique_element_ids)
        f_type = TFFileType.from_mime_type(content_type)
    elif content_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
        from unstructured.partition.docx import partition_docx

        els = partition_docx(file=file_obj, language=language, unique_element_ids=unique_element_ids)
        f_type = TFFileType.from_mime_type(content_type)
    else:
        from unstructured.file_utils.filetype import detect_filetype

        if not (ft := detect_filetype(file=file_obj, metadata_file_path=filename)):
            raise ValueError("无法识别文件类型")
        else:
            f_type = TFFileType.from_mime_type(ft.mime_type)
        if f_type not in self.mime_types:
            raise ValueError(f"不支持的文件类型：{f_type}")
        from unstructured.partition.auto import partition

        els = partition(file=file_obj, language=language, unique_element_ids=unique_element_ids)
    if f_type is None:
        raise ValueError("未正确识别文件类型")
    # 排除空值
    all_els = [
        create_element_by_unstructured_element(e, filename=filename) for e in els if e and e.id and str(e).strip()
    ]

    # 判断是否有chunk_size要求，如果有的话，按要求进行chuck调整
    if self.min_chunk_size or self.max_chunk_size:
        all_els = self._adjust_chunk_size(all_els)
    return Document.from_elements(all_els, file_uri=AnyUrl(uri), file_type=f_type)