Data schema

From ScienceSource
Revision as of 16:51, 20 December 2018 by Charles Matthews (talk | contribs) (Property table: update per fallout of discussion of December 13)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This is the basic schema that will be used to store text-mining annotations in ScienceSource. See notes below.

Property table[edit]

Property code Property label Type of field Notes
P2 Wikidata item code External identifier Needs formatter URL set as$. Check documentation, see needed string property below.
P3 instance of Item Equivalent of P31 on Wikidata
P4 subclass of Item Equivalent of P279 on Wikidata
P6 preceding anchor point Item
P7 following anchor point Item
P8 distance to preceding Quantity
P9 distance to following Quantity
P10 character number Quantity Offset from the initial annotation point in the article. If not a robust figure, adequate for the project. The text version saved as the article here will be the "SI standard".
P11 article text title String Not disambiguated (see P20).
P12 anchor point in Item Refers back to underlying article.
P13 preceding phrase String (1) Subject to a character limit
(2) Initially may have some tags
(3) Initially not constrained by spaces; certainly, though, from the point of view of human readability, should not have breaks in the middle of words
P14 following phrase String As for P13
P15 term found String Word or phrase from dictionary, mentioned in text and starting at anchor point
P16 dictionary name String ScienceSource dictionary as named, date tracked by P17 for version
P17 publication date Point in time For articles
P18 length of term found Quantity String length of term, pre-computed for use in offsets and constraint checking
P19 based on Item For an item that is instance of annotation, "based on" has as object the anchor point or annotation it is based on. Therefore this is a child-parent type of property, defining the tree of annotations growing out of a given anchor point. As a constraint, every annotation is required to have such a statement.
P20 ScienceSource article title String Identifies article, by disambiguated title, for human readability. Because of disambiguation, this title will not always coincide with other versions of the title, such as given by P11. There could also be a cross-namespace version of this property that was lyingly based on "external identifier".
P22 time code Point in time MediaWiki UTC code, set by creation time for batch (approximate, needed for batch tagging)
P24 anchors Item Partial inverse property of P19.
P25 Page ID Quantity MediaWiki page identifier.
P26 is subject of a Wikidata triple with object Item Drug annotations can be linked to disease annotations when the text states that a Wikidata P2175 statement on Wikidata holds. (Such statements may then be converted into annotations.)
P27 Wikidata property in a claimed triple String Identifies the Wikidata property in a claimed triple (default P2175)
P28 Wikidata subject item String Identifies the Wikidata subject in a claimed triple
P29 Wikidata object item String Identifies the Wikidata object in a claimed triple
P30 human has checked Item For fact-checking, could be used as a bot intermediate to an annotation, depending on implementation
P31 supersedes Item Expresses the dominance relation between reviews, for clinical purposes
P32 include only after Quantity For filtering annotations by restriction to part of an article, by offset
P33 include only before Quantity For filtering annotations by restriction to part of an article, by offset
P34 deprecation for reason Item For fine-grained analysis of reasoning that a review should fail MEDRS, in terms of publication type ontology
P35 passed MEDRS Point in time For recording with a time-stamp the acceptance of a "fact found" annotation by the MEDRS algorithm
P36 failed MEDRS Point in time For recording with a time-stamp the rejection of a "fact found" annotation by the MEDRS algorithm
P? formatter URL string Equivalent of P1630 on Wikidata

. See mail thread on configuration

P? (not yet defined)


  1. For "type of field" see For data types available on ScienceSource, see
  2. This table now uses "P" for the property prefix. As of October 2018, a Phabricato rthread on using another prefix is still active -T202676.
  3. The schema will be extended, in particular for checking annotations.
  4. The project will comply with the W3C Web Annotation Data Model of February 2017. The annotations will be stored here in a Wikibase site, so that inherently they are available in RDF. In principle several dumps will be available from the site, such an RDF of all annotations and other data here, and a dump just of the annotations directly pointing to the articles (i.e. none of the community-added annotations, or of the auxiliary data). What we mean by compliance to the standard will be the availability in principle of the annotations, in the W3C-recommended JSON format.

Item table[edit]

Item code Item label Comments
Q2 anchor point Anchor points are where initial annotations hang off articles, and are first-class entities in ScienceSource ontology. We use stand-off annotation, so that nothing is actually inserted into articles. Notionally an anchor point is a place you could find in the article text in an article with your cursor, so for example any place between two letters. In practice anchor points will typically be between a space and an alphanumeric character.
Q4 article Wikibase indexation of the Article: namespace; also serves as initial anchor point for each article.
Q5 annotation An annotation must hang off (a) an anchor point, or (b) another annotation.
Q6 terminus Terminal marker defined uniformly for all articles. The actual final anchor point in an article will be the one linking to Q6, i.e. having a P7 statement with object Q6.
Q7 demo article item
Q8 demo anchor point item
Q9 demo annotation item
Q6818 dictionary item