Schema.org数据集词汇表：技术术语全解析

　　Note: You are viewing the development version of Schema.org. See how we work for more details.

注意：您正在查看 Schema.org 的开发版本。详情请参阅我们的工作方式。

Introduction

　　This document provides technical background on the concepts of "data" and "dataset用于描述打包数据集合的Schema.org类型，通常用于科学、学术或政府开放数据存储库中发布的数据。" within the Schema.org ecosystem. As a project and a vocabulary, Schema.org is fundamentally about data—its definition, characterization, description, and encoding. The core vocabulary includes types like Event, NewsArticle, Review, and Person, along with properties that describe and link instances of these types (e.g., the alumni property linking a Person to an EducationalOrganization). However, complexities arise when the subject of description is itself a bundle of data, necessitating dedicated vocabulary for describing datasets and statistical aggregates.

本文档提供了关于 Schema.org 生态系统中“数据”和“数据集”概念的技术背景。作为一个项目和一套词汇表，Schema.org 从根本上说是关于数据的——包括其定义、表征、描述和编码。其核心词汇表包含诸如 Event、NewsArticle、Review 和 Person 等类型，以及用于描述和链接这些类型实例的属性（例如，alumni 属性将 Person 与 EducationalOrganization 关联起来）。然而，当描述的对象本身就是一个数据包时，复杂性就出现了，这就需要专门的词汇表来描述数据集和统计聚合数据。

Core Concepts: Describing Data with Schema.org

　　Beyond its primary role in describing real-world entities, Schema.org also provides dedicated terms for applications that publish, discover, or integrate various forms of data. This capability complements Schema.org's foundational nature as a collection of structured data schemas and coexists with numerous other data-centric standards and formats.

除了描述现实世界实体的主要作用外，Schema.org 还为发布、发现或集成各种形式数据的应用程序提供了专用术语。这一能力补充了 Schema.org 作为结构化数据模式集合的基础性质，并与许多其他以数据为中心的标准和格式共存。

The `Dataset` Type and Related Vocabulary

　　When describing packaged collections of data—such as those found in scientific, scholarly, or governmental open-data repositories—Schema.org offers specific types:

Dataset: Used to describe a collection of data. (用于描述数据集合。)
DataCatalog: Indicates a larger collection or repository containing multiple datasets. (表示包含多个数据集的更大集合或存储库。)
DataDownload: Represents a specific, downloadable file format of a dataset用于描述打包数据集合的Schema.org类型，通常用于科学、学术或政府开放数据存储库中发布的数据。. (表示数据集特定的、可下载的文件格式。)

　　Unlike typical Schema.org markup that describes webpage content, datasets described with this vocabulary can be in arbitrary formats (e.g., CSV files, digital images, or specialized scientific formats). This diversity reflects real-world complexity but also creates integration challenges for unified knowledge graphs like Wikidata and DataCommons.org. The Dataset vocabulary in Schema.org was originally based on DCAT数据目录词汇表，是W3C推荐的标准，用于描述数据目录和数据集。Schema.org的Dataset词汇表基于此标准。 (Data Catalog Vocabulary), which itself utilized terms from Dublin Core and FOAF.

当描述打包的数据集合时——例如在科学、学术或政府开放数据存储库中找到的那些——Schema.org 提供了特定的类型。与描述网页内容的典型 Schema.org 标记不同，用此词汇表描述的数据集可以是任意格式（例如，CSV 文件、数字图像或专业科学格式）。这种多样性反映了现实世界的复杂性，但也给 Wikidata 和 DataCommons.org 等统一知识图谱带来了集成挑战。Schema.org 中的 Dataset 词汇表最初基于 DCAT数据目录词汇表，是W3C推荐的标准，用于描述数据目录和数据集。Schema.org的Dataset词汇表基于此标准。（数据目录词汇表），而 DCAT数据目录词汇表，是W3C推荐的标准，用于描述数据目录和数据集。Schema.org的Dataset词汇表基于此标准。本身使用了都柏林核心（Dublin Core）和 FOAF 的术语。

Statistical Data: `StatisticalPopulation` and `Observation`

　　For aggregating and integrating statistical observations about collections ("populations") of entities, Schema.org provides:

StatisticalPopulation: Represents a set of instances of a certain type that share common characteristics. (表示共享共同特征的某种类型实例的集合。)
Observation: Represents a specific, measured data point about a StatisticalPopulation. (表示关于一个 StatisticalPopulation 的特定测量数据点。)

　　This approach, detailed in its proposal and overview document, emphasizes using Schema.org vocabulary to integrate information from multiple independent statistical datasets. It explains the content of statistical data using a shared vocabulary, as demonstrated on a large scale by DataCommons.org.

为了聚合和集成关于实体集合（“总体”）的统计观测数据，Schema.org 提供了 StatisticalPopulation 和 Observation 类型。这种方法在其提案和概述文档中有详细说明，强调使用 Schema.org 词汇表来集成来自多个独立统计数据集的信息。它使用共享词汇表解释统计数据的内容，正如 DataCommons.org 大规模演示的那样。

Practical Distinctions and Examples

　　The distinction between these layers of abstraction is crucial. Consider data about volcanoes:

Entity-Level (Volcano): The https://schema.org/Volcano type is used to provide direct information about a specific volcano (e.g., its name, location, elevation).
Dataset用于描述打包数据集合的Schema.org类型，通常用于科学、学术或政府开放数据存储库中发布的数据。-Level (Dataset): A https://schema.org/Dataset type describes a collection of data about volcanoes, such as a CSV file or XML archive containing records for multiple volcanoes.
Statistical-Level (Observation): The Observation type can represent aggregate statistics (e.g., average elevation, total count) about a population of volcanoes defined by a StatisticalPopulation.

这些抽象层次之间的区别至关重要。以火山数据为例：https://schema.org/Volcano 类型用于提供关于特定火山的直接信息（例如，其名称、位置、海拔）。https://schema.org/Dataset 类型描述关于火山的数据集合，例如包含多个火山记录的 CSV 文件或 XML 存档。Observation 类型可以表示关于由 StatisticalPopulation 定义的火山总体的聚合统计信息（例如，平均海拔、总数）。

Related Standards and Technologies

　　Schema.org's data-related vocabulary interacts with and complements several other W3C and community specifications:

W3C CSV on the Web (CSVW): Provides a metadata format for describing CSV files. (为描述 CSV 文件提供元数据格式。)
RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。 Data Cube Vocabulary: A standard for publishing multi-dimensional statistical data. (用于发布多维统计数据的标准。)
Dataset用于描述打包数据集合的Schema.org类型，通常用于科学、学术或政府开放数据存储库中发布的数据。 Publishing Language (DSPL 2.0): Combines Schema.org for dataset用于描述打包数据集合的Schema.org类型，通常用于科学、学术或政府开放数据存储库中发布的数据。 metadata with CSV files to represent code lists and statistical observations, offering a high-fidelity representation of datasets. (将用于数据集元数据的 Schema.org 与用于表示代码列表和统计观测数据的 CSV 文件相结合，提供数据集的高保真表示。)

　　These technologies often rely on lower-level standards like JSON-LD, RDFa, Microdata, and XML, sharing a broadly RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。-like model for information representation. Furthermore, standards exist to "lift" factual data from various dataset用于描述打包数据集合的Schema.org类型，通常用于科学、学术或政府开放数据存储库中发布的数据。 formats into RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。 statements using vocabularies like Schema.org:

R2RML: Maps data from relational databases (SQL) to RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。. (将关系数据库（SQL）中的数据映射到 RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。。)
GRDDL: Extracts RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。 from XML documents using XSLT transformations. (使用 XSLT 转换从 XML 文档中提取 RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。。)
CSVW to RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。 Mappings: Converts static tabular data (CSV) to RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。. (将静态表格数据（CSV）转换为 RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。。)
JSON-LD Context: Provides a mechanism for mapping JSON data structures to an RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。 graph. (提供了一种将 JSON 数据结构映射到 RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。图的机制。)

Schema.org 与数据相关的词汇表与其它一些 W3C 和社区规范相互作用并互为补充。这些技术通常依赖于 JSON-LD、RDFa、Microdata 和 XML 等底层标准，共享一种广义上类似 RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。的信息表示模型。此外，还存在一些标准，用于将事实数据从各种数据集格式“提升”为使用 Schema.org 等词汇表的 RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。陈述。

Conclusion

　　Schema.org provides a multi-layered framework for engaging with data on the web. While its primary strength lies in describing individual entities (Volcano, Person), it also offers robust vocabulary for describing collections of data as publishable artifacts (Dataset) and for representing aggregated statistical observations (StatisticalPopulation/Observation). Understanding the appropriate context for each layer—entity, dataset用于描述打包数据集合的Schema.org类型，通常用于科学、学术或政府开放数据存储库中发布的数据。, or statistic—is key to effectively leveraging Schema.org alongside related standards like DCAT数据目录词汇表，是W3C推荐的标准，用于描述数据目录和数据集。Schema.org的Dataset词汇表基于此标准。, CSVW, and RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。 Data Cube for comprehensive data publishing and integration.

Schema.org 提供了一个多层框架来处理网络上的数据。虽然其主要优势在于描述单个实体（Volcano、Person），但它也提供了强大的词汇表，用于将数据集合描述为可发布的制品（Dataset），以及用于表示聚合的统计观测数据（StatisticalPopulation/Observation）。理解每个层次（实体、数据集或统计信息）的适用语境，是有效利用 Schema.org 以及 DCAT数据目录词汇表，是W3C推荐的标准，用于描述数据目录和数据集。Schema.org的Dataset词汇表基于此标准。、CSVW 和 RDF资源描述框架，是W3C标准，用于表示网络资源的信息。Schema.org和相关技术采用类似RDF的方法来表示信息。 Data Cube 等相关标准进行全面的数据发布和集成的关键。