Additions to Wikidata

From ScienceSource
Revision as of 09:24, 7 November 2018 by Charles Matthews (talk | contribs) (cat)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

An important part of the ScienceSource project is to build up metadata on Wikidata, relating to the "scholarly article" items there.

Proposals of method for further additions

The most useful data additions for the project are: licenses, publication type and main subjects

Licenses

P275 on Wikidata


Candidate number Method Comments
1 Identify journals with a licensing policy, and have regard to time periods during which a particular license applied
2 Derive information from PubMed Central pages, where the pages link to pages about the license that applies. Model code is in the Oravrattas query on GitHub
3 Prepare lists of interesting DOIs or other identifiers, and then use the unpaywall API There is also the Dissemin site. There is apparently an issue that unpaywall may give a class of license such as CC-by-SA, but not the version. This does not impact on whether we can download the paper.

Publication type

Unclear at present whether to use "instance of" with an object item, or "genre".

Candidate number Method Comments
1 Extract from PubMed pages Many pages carry "Review", an undifferentiated term that does not equate to systematic review
2 Use query techniques in MEDLINE This Petscan query finds article talk pages on English Wikipedia that carry the template Template:Reliable sources for medical articles. That template creates links to search terms in PubMed (and other sites) that can find relevant review publications, with side conditions on publication date, open access and systematic reviews.
3 Use sites such as Tripdatabase The same template creates search terms in the Trip clinical search engine at www.tripdatabase.com.

Main subject

P921 on Wikidata


Candidate number Method Comments
1 Text mining Can be applied to title, abstract or full text. Not known to be effective in terms of precision.
2 Extract from PubMed There is a technical obstacle to finding MeSH identifiers this way. Once found, those identifiers can be translated in cases where Wikidata has them, by a reverse search. The recall for this process can be improved by using mix'n'match to increase the number of MeSH IDs in Wikidata.

Additions to focus list

P5008 on Wikidata

Additions to the ScienceSource focus list aren't metadata additions as such. But the reasons to add an item to that list typically depend on metadata considerations. The filtering of the focus list that gives rise to a download of papers onto this wiki necessarily includes knowing license information, for example. The most common general way to handle lists of articles is by manipulating lists of DOIs or PubMed IDs, for example.

Candidate number Method Comments
1 Starting from articles carrying the Template:MEDRS on English Wikipedia: (i) extract DOIs from the citations in the article; (ii) translate the DOIs into item number on Wikidata; and (iii) add that list to the focus list, using QuickStatements, possibly after some filtering. The general comment about filtering as in (iii) is that it is supposed to be carried out with finesse, not excluding by premature marginalisation some articles that are less mainstream. This is an aspect of the "systematic bias" issue that is on the project agenda.
There are several possibilities for the translation in (ii): (a) lookup from a dump; (b) requests to the Resolver tool; (c) batch SPARQL queries using "values" statements.
2 Similar idea, but starting with articles neglected by some criterion, for example low number of visits measured by the treeviews tool. Ditto
3 Search by journal, i.e. P1433, and filter. Some open access journals will give a large number of papers, and filtering by publication type is then going to be needed, to avoid having less relevant material on the list. This is without prejudice to the systematic bias considerations above.
4 Similar idea to 1 and 2, but using top 100 medical articles by some criterion. Same techniques, but such additions would presumably reinforce systematic bias. A fallback option to build up the list.

Other needs

Retraction data, precedence of Cochrane reviews over others, way of filtering out predatory publishers (presumably they are not even welcome on Wikidata).

Focus list

Participation in the ScienceSource project, as of August 2018, is on Wikidata and the focus list at

https://www.wikidata.org/wiki/Wikidata:ScienceSource_focus_list

See

https://www.wikidata.org/wiki/Wikidata%3AScienceSource_focus_list%2FMain_subject_needed

for a daily update on the focus list, in terms of article items lacking a main subject.

As of 13 August, there were 2957 articles on the whole list. Those needing "main subject" on 14 August were 729 in number. For the purposes of the project, a disease main subject is particularly valuable.

As of 20 August, there were 3449 articles on the whole list, and 598 needing "main subject".

Project metrics

  • [1] for 11 June 2018
  • [2] for 12 June 2018
  • [3] for 13 June 2018
  • [4] for 14 June 2018
  • [5] for 18 June 2018
  • [6] for 19 June 2018

The MeSH Disease catalog on mix'n'match was completed on 19 June. Thanks to the others involved.


Neglected disease analytics


MeSH Code

Health specialty

Main subject

  • [12] run 14 August P921
  • [13] run 15 August P921
  • [14] 14 to 21 August, QuickStatementsBot adding P921 to focus list items.