Skip to content

Email Loader

Email Loader

EmailLoader

Bases: UnstructuredLoader

Class for loading email files as documents.

Inherits from UnstructuredLoader.

Attributes:

Name Type Description
mime_types ClassVar[set[Literal[EML]]]

A class variable that holds the MIME types for the email files. It is a set of FileType.EML literals. This attribute is frozen and cannot be modified.

load

load(path_or_uri: str, *, file: bytes | IO[bytes] | None = None, content_type: Optional[str] = None, **kwargs: Any) -> Document

从给定的 URI 或文件对象加载文档并返回 Document 对象 / Load document from the given URI or file object and return Document.

Parameters:

Name Type Description Default
path_or_uri str

文档的路径或 URI (必填)/ File path or uri (required)

required
file bytes | IO[bytes] | None

文件内容,可以是 bytes 或 IO[bytes] 对象(与 path_or_uri 参数二选一)/ File content as bytes or IO[bytes] (mutually exclusive with path_or_uri).

None
content_type Optional[str]

文档的内容类型(MIME 类型),如果不提供则尝试自动检测 / The content type (MIME type), auto-detected if not provided.

None
**kwargs Any

其他可选参数,传递给底层的分片函数 / Additional keyword arguments passed to the partition function.

{}

Returns:

Name Type Description
Document Document

返回一个包含文档内容的 Document 对象 / The loaded document.

Raises:

Type Description
ValueError

如果 path_or_uri 和 file 都未提供,或都提供了 / If neither or both path_or_uri and file are provided.

Source code in tfrobot/utils/document_loaders/email.py
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def load(
    self,
    path_or_uri: str,
    *,
    file: bytes | IO[bytes] | None = None,
    content_type: Optional[str] = None,
    **kwargs: Any,
) -> Document:
    """
    从给定的 URI 或文件对象加载文档并返回 Document 对象 / Load document from the given URI or file object and return Document.

    Args:
        path_or_uri: 文档的路径或 URI (必填)/ File path or uri (required)
        file: 文件内容,可以是 bytes 或 IO[bytes] 对象(与 path_or_uri 参数二选一)/ File content as bytes or IO[bytes] (mutually exclusive with path_or_uri).
        content_type: 文档的内容类型(MIME 类型),如果不提供则尝试自动检测 / The content type (MIME type), auto-detected if not provided.
        **kwargs: 其他可选参数,传递给底层的分片函数 / Additional keyword arguments passed to the partition function.

    Returns:
        Document: 返回一个包含文档内容的 Document 对象 / The loaded document.

    Raises:
        ValueError: 如果 path_or_uri 和 file 都未提供,或都提供了 / If neither or both path_or_uri and file are provided.
    """
    from unstructured.partition.email import partition_email

    uri = convert_to_file_url(path_or_uri)
    # 处理文件对象 / Handle file object
    file_obj = BytesIO(file) if isinstance(file, bytes) else file if file is not None else self.get_file_obj(uri)
    unique_element_ids = kwargs.pop("unique_element_ids", True)  # 是否使用唯一元素ID
    filename = kwargs.pop(
        "filename", get_filename_from_uri(uri)
    )  # 因为TFRobot partition封装均是使用IO流进行解析,如果再传入filename UnstructuredIO会报错。
    kwargs.pop("url", None)  # 因为TFRobot partition封装均是使用IO流进行解析,如果再传入url UnstructuredIO会报错。
    if not unique_element_ids:
        raise ValueError("TFRobot必须启用唯一元素ID以确保正确处理文档。")

    els = partition_email(file=file_obj, unique_element_ids=unique_element_ids)  # noqa
    return Document.from_elements(
        [create_element_by_unstructured_element(e, filename=filename) for e in els],
        file_uri=AnyUrl(uri),
        file_type=TFFileType.EML,
    )