GEO
热门schema

Schema.org数据集词汇表:技术术语全解析

2026/1/26
Schema.org数据集词汇表:技术术语全解析
AI Summary (BLUF)

Schema.org's technical overview on describing data and datasets, including the dedicated Dataset type. (Schema.org 关于描述数据和数据集的技术概述,包括专用的 Dataset 类型。)

  Note: You are viewing the development version of Schema.org. See how we work for more details.

注意:您正在查看 Schema.org 的开发版本。详情请参阅 我们的工作方式

Introduction

  This document provides technical background on the concepts of "data" and "dataset" within the Schema.org ecosystem. As a project and a vocabulary, Schema.org is fundamentally about data—its definition, characterization, description, and encoding. The core vocabulary includes types like Event, NewsArticle, Review, and Person, along with properties that describe and link instances of these types (e.g., the alumni property linking a Person to an EducationalOrganization). However, complexities arise when the subject of description is itself a bundle of data, necessitating dedicated vocabulary for describing datasets and statistical aggregates.

本文档提供了关于 Schema.org 生态系统中“数据”和“数据集”概念的技术背景。作为一个项目和一套词汇表,Schema.org 从根本上说是关于数据的——包括其定义、表征、描述和编码。其核心词汇表包含诸如 EventNewsArticleReviewPerson 等类型,以及用于描述和链接这些类型实例的属性(例如,alumni 属性将 PersonEducationalOrganization 关联起来)。然而,当描述的对象本身就是一个数据包时,复杂性就出现了,这就需要专门的词汇表来描述数据集和统计聚合数据。

Core Concepts: Describing Data with Schema.org

  Beyond its primary role in describing real-world entities, Schema.org also provides dedicated terms for applications that publish, discover, or integrate various forms of data. This capability complements Schema.org's foundational nature as a collection of structured data schemas and coexists with numerous other data-centric standards and formats.

除了描述现实世界实体的主要作用外,Schema.org 还为发布、发现或集成各种形式数据的应用程序提供了专用术语。这一能力补充了 Schema.org 作为结构化数据模式集合的基础性质,并与许多其他以数据为中心的标准和格式共存。

The Dataset Type and Related Vocabulary

  When describing packaged collections of data—such as those found in scientific, scholarly, or governmental open-data repositories—Schema.org offers specific types:

  • Dataset: Used to describe a collection of data. (用于描述数据集合。)
  • DataCatalog: Indicates a larger collection or repository containing multiple datasets. (表示包含多个数据集的更大集合或存储库。)
  • DataDownload: Represents a specific, downloadable file format of a dataset. (表示数据集特定的、可下载的文件格式。)

  Unlike typical Schema.org markup that describes webpage content, datasets described with this vocabulary can be in arbitrary formats (e.g., CSV files, digital images, or specialized scientific formats). This diversity reflects real-world complexity but also creates integration challenges for unified knowledge graphs like Wikidata and DataCommons.org. The Dataset vocabulary in Schema.org was originally based on DCAT (Data Catalog Vocabulary), which itself utilized terms from Dublin Core and FOAF.

当描述打包的数据集合时——例如在科学、学术或政府开放数据存储库中找到的那些——Schema.org 提供了特定的类型。与描述网页内容的典型 Schema.org 标记不同,用此词汇表描述的数据集可以是任意格式(例如,CSV 文件、数字图像或专业科学格式)。这种多样性反映了现实世界的复杂性,但也给 Wikidata 和 DataCommons.org 等统一知识图谱带来了集成挑战。Schema.org 中的 Dataset 词汇表最初基于 DCAT(数据目录词汇表),而 DCAT 本身使用了都柏林核心(Dublin Core)和 FOAF 的术语。

Statistical Data: StatisticalPopulation and Observation

  For aggregating and integrating statistical observations about collections ("populations") of entities, Schema.org provides:

  • StatisticalPopulation: Represents a set of instances of a certain type that share common characteristics. (表示共享共同特征的某种类型实例的集合。)
  • Observation: Represents a specific, measured data point about a StatisticalPopulation. (表示关于一个 StatisticalPopulation 的特定测量数据点。)

  This approach, detailed in its proposal and overview document, emphasizes using Schema.org vocabulary to integrate information from multiple independent statistical datasets. It explains the content of statistical data using a shared vocabulary, as demonstrated on a large scale by DataCommons.org.

为了聚合和集成关于实体集合(“总体”)的统计观测数据,Schema.org 提供了 StatisticalPopulationObservation 类型。这种方法在其提案概述文档中有详细说明,强调使用 Schema.org 词汇表来集成来自多个独立统计数据集的信息。它使用共享词汇表解释统计数据的内容,正如 DataCommons.org 大规模演示的那样。

Practical Distinctions and Examples

  The distinction between these layers of abstraction is crucial. Consider data about volcanoes:

  • Entity-Level (Volcano): The https://schema.org/Volcano type is used to provide direct information about a specific volcano (e.g., its name, location, elevation).
  • Dataset-Level (Dataset): A https://schema.org/Dataset type describes a collection of data about volcanoes, such as a CSV file or XML archive containing records for multiple volcanoes.
  • Statistical-Level (Observation): The Observation type can represent aggregate statistics (e.g., average elevation, total count) about a population of volcanoes defined by a StatisticalPopulation.

这些抽象层次之间的区别至关重要。以火山数据为例:https://schema.org/Volcano 类型用于提供关于特定火山的直接信息(例如,其名称、位置、海拔)。https://schema.org/Dataset 类型描述关于火山的数据集合,例如包含多个火山记录的 CSV 文件或 XML 存档。Observation 类型可以表示关于由 StatisticalPopulation 定义的火山总体的聚合统计信息(例如,平均海拔、总数)。

Related Standards and Technologies

  Schema.org's data-related vocabulary interacts with and complements several other W3C and community specifications:

  • W3C CSV on the Web (CSVW): Provides a metadata format for describing CSV files. (为描述 CSV 文件提供元数据格式。)
  • RDF Data Cube Vocabulary: A standard for publishing multi-dimensional statistical data. (用于发布多维统计数据的标准。)
  • Dataset Publishing Language (DSPL 2.0): Combines Schema.org for dataset metadata with CSV files to represent code lists and statistical observations, offering a high-fidelity representation of datasets. (将用于数据集元数据的 Schema.org 与用于表示代码列表和统计观测数据的 CSV 文件相结合,提供数据集的高保真表示。)

  These technologies often rely on lower-level standards like JSON-LD, RDFa, Microdata, and XML, sharing a broadly RDF-like model for information representation. Furthermore, standards exist to "lift" factual data from various dataset formats into RDF statements using vocabularies like Schema.org:

  • R2RML: Maps data from relational databases (SQL) to RDF. (将关系数据库(SQL)中的数据映射到 RDF。)
  • GRDDL: Extracts RDF from XML documents using XSLT transformations. (使用 XSLT 转换从 XML 文档中提取 RDF。)
  • CSVW to RDF Mappings: Converts static tabular data (CSV) to RDF. (将静态表格数据(CSV)转换为 RDF。)
  • JSON-LD Context: Provides a mechanism for mapping JSON data structures to an RDF graph. (提供了一种将 JSON 数据结构映射到 RDF 图的机制。)

Schema.org 与数据相关的词汇表与其它一些 W3C 和社区规范相互作用并互为补充。这些技术通常依赖于 JSON-LD、RDFa、Microdata 和 XML 等底层标准,共享一种广义上类似 RDF 的信息表示模型。此外,还存在一些标准,用于将事实数据从各种数据集格式“提升”为使用 Schema.org 等词汇表的 RDF 陈述。

Conclusion

  Schema.org provides a multi-layered framework for engaging with data on the web. While its primary strength lies in describing individual entities (Volcano, Person), it also offers robust vocabulary for describing collections of data as publishable artifacts (Dataset) and for representing aggregated statistical observations (StatisticalPopulation/Observation). Understanding the appropriate context for each layer—entity, dataset, or statistic—is key to effectively leveraging Schema.org alongside related standards like DCAT, CSVW, and RDF Data Cube for comprehensive data publishing and integration.

Schema.org 提供了一个多层框架来处理网络上的数据。虽然其主要优势在于描述单个实体(VolcanoPerson),但它也提供了强大的词汇表,用于将数据集合描述为可发布的制品(Dataset),以及用于表示聚合的统计观测数据(StatisticalPopulation/Observation)。理解每个层次(实体、数据集或统计信息)的适用语境,是有效利用 Schema.org 以及 DCAT、CSVW 和 RDF Data Cube 等相关标准进行全面的数据发布和集成的关键。

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。