Unstructured Loader

Unstructured Loader¶

UnstructuredLoader ¶

Bases: BaseLoader

任意类型的文档读取。子类需要明确定义cls.mime_types类型

Notes

注意使用Unstructured库对文档进行读取与分片后，不可以使用ele.id作为数据库主键使用。因为其存在重复的可能性。

partition_kwargs `property` ¶

partition_kwargs: dict

用户进行分片调用时的一些特殊配置参数

Returns:

Type	Description
`dict`	dict

load ¶

load(path_or_uri: str, *, file: bytes | IO[bytes] | None = None, content_type: Optional[str] = None, **kwargs: Any) -> Document

从给定的 URI 或文件对象加载文档并返回 Document 对象 / Load document from the given URI or file object and return Document.

Parameters:

Name	Type	Description	Default
`path_or_uri`	`str`	文档的路径或 URI （必填）/ File path or uri (required)	required
`file`	`bytes \| IO[bytes] \| None`	文件内容，可以是 bytes 或 IO[bytes] 对象（与 path_or_uri 参数二选一）/ File content as bytes or IO[bytes] (mutually exclusive with path_or_uri).	`None`
`content_type`	`Optional[str]`	文档的内容类型（MIME 类型），如果不提供则尝试自动检测 / The content type (MIME type), auto-detected if not provided.	`None`
`**kwargs`	`Any`	其他可选参数，传递给底层的分片函数 / Additional keyword arguments passed to the partition function.	`{}`

Returns:

Name	Type	Description
`Document`	`Document`	返回一个包含文档内容的 Document 对象 / The loaded document.

Raises:

Type	Description
`ValueError`	如果 path_or_uri 和 file 都未提供，或都提供了 / If neither or both path_or_uri and file are provided.

Source code in tfrobot/utils/document_loaders/unstructured.py

def load(
    self,
    path_or_uri: str,
    *,
    file: bytes | IO[bytes] | None = None,
    content_type: Optional[str] = None,
    **kwargs: Any,
) -> Document:
    """
    从给定的 URI 或文件对象加载文档并返回 Document 对象 / Load document from the given URI or file object and return Document.

    Args:
        path_or_uri: 文档的路径或 URI （必填）/ File path or uri (required)
        file: 文件内容，可以是 bytes 或 IO[bytes] 对象（与 path_or_uri 参数二选一）/ File content as bytes or IO[bytes] (mutually exclusive with path_or_uri).
        content_type: 文档的内容类型（MIME 类型），如果不提供则尝试自动检测 / The content type (MIME type), auto-detected if not provided.
        **kwargs: 其他可选参数，传递给底层的分片函数 / Additional keyword arguments passed to the partition function.

    Returns:
        Document: 返回一个包含文档内容的 Document 对象 / The loaded document.

    Raises:
        ValueError: 如果 path_or_uri 和 file 都未提供，或都提供了 / If neither or both path_or_uri and file are provided.
    """
    from unstructured.file_utils.filetype import detect_filetype
    from unstructured.partition.auto import partition

    uri = convert_to_file_url(path_or_uri)
    filename = kwargs.pop("filename", get_filename_from_uri(uri))
    # 处理文件对象 / Handle file object
    file_obj = BytesIO(file) if isinstance(file, bytes) else file if file is not None else self.get_file_obj(uri)

    unique_element_ids = kwargs.pop("unique_element_ids", True)  # 是否使用唯一元素ID
    kwargs.pop("url", None)  # 因为TFRobot partition封装均是使用IO流进行解析，如果再传入url UnstructuredIO会报错。
    partition_kwargs = self.partition_kwargs
    partition_kwargs.update(kwargs)
    if not unique_element_ids:
        raise ValueError("TFRobot必须启用唯一元素ID以确保正确处理文档。")

    # 检测文件类型 / Detect file type
    if not content_type:
        # 使用文件对象检测 / Detect using file object
        file_obj.seek(0)
        if not (ft := detect_filetype(file=file_obj, metadata_file_path=filename)):
            raise ValueError("无法识别文件类型")
        else:
            f_type = TFFileType.from_mime_type(ft.mime_type)
    else:
        f_type = TFFileType.from_mime_type(content_type)

    if f_type is None:
        raise ValueError("未正确识别文件类型")
    elif hasattr(self, "mime_types") and f_type not in self.mime_types:
        raise ValueError(f"不支持的文件类型：{f_type}")

    # 执行分片 / Perform partitioning
    # 使用文件对象 / Use file object
    file_obj.seek(0)
    els_raw = partition(file=file_obj, unique_element_ids=True, content_type=content_type, **partition_kwargs)

    if self.element_filter:
        els_raw = self.element_filter(els_raw)
    # 在此以els.id进行一次去重操作，并排除空值
    # 同时需要注意，数据库不可以使用此ID作为主键存储！！！
    # 排除空值
    els = []
    seen_ids = set()  # 用于存储已经遇到的 id

    for e in els_raw:
        if e and e.id and str(e).strip():  # 保留有效的元素
            if e.id not in seen_ids:  # 如果 id 没有出现过
                els.append(e)  # 将元素添加到结果列表
                seen_ids.add(e.id)  # 将 id 加入到已见集合中

    all_els = [create_element_by_unstructured_element(e, filename=filename) for e in els]
    # 判断是否有chunk_size要求，如果有的话，按要求进行chuck调整
    if self.min_chunk_size or self.max_chunk_size:
        all_els = self._adjust_chunk_size(all_els)
    return Document.from_elements(all_els, file_uri=AnyUrl(uri), file_type=f_type)

Unstructured Loader

Unstructured Loader¶

UnstructuredLoader ¶

partition_kwargs property ¶

load ¶

partition_kwargs `property` ¶