Abstract: The public sharing of primary research datasets potentially benefits the research community but is not yet common practice. In this pilot study, we analyzed whether data sharing frequency was associated with funder and publisher requirements, journal impact factor, or investigator experience and impact. Across 397 recent biomedical microarray studies, we found investigators were more likely to publicly share their raw dataset when their study was published in a high-impact journal and when the first or last authors had high levels of career experience and impact. We estimate the USA's National Institutes of Health (NIH) data sharing policy applied to 19% of the studies in our cohort; being subject to the NIH data sharing plan requirement was not found to correlate with increased data sharing behavior in multivariate logistic regression analysis. Studies published in journals that required a database submission accession number as a condition of publication were more likely to share their data, but this trend was not statistically significant. These early results will inform our ongoing larger analysis, and hopefully contribute to the development of more effective data sharing initiatives.
Notes: Earlier version presented at ASIS&T and ISSI Pre-Conference: Symposium on Informetrics and Scientometrics 2009.
Raw data: http://www.researchremix.org/wordpress/wp-content/uploads/2009/09/Piwowar_Metrics2009_rawdata.csv
Statistics file: http://www.researchremix.org/wordpress/wp-content/uploads/2009/09/Piwowar_Metrics2009_statistics.R
Abstract: Sharing biomedical research and health care data is important but difficult. Recognizing this, many initiatives facilitate, fund, request, or require researchers to share their data. These initiatives address the technical aspects of data sharing, but rarely focus on incentives for key stakeholders. Academic health centers (AHCs) have a critical role in enabling, encouraging, and rewarding data sharing. The leaders of medical schools and academic-affiliated hospitals can play a unique role in supporting this transformation of the research enterprise. We propose that AHCs can and should lead the transition towards a culture of biomedical data sharing.
Abstract: Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for the investigator who makes his or her data available.
We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression. This correlation between publicly available data and increased literature impact may further motivate investigators to share their detailed research data.
Notes: Presentation slides: http://www.slideshare.net/hpiwowar/presentations; Raw bibliometric data used in the analysis, combining data extracted from Thomson ISI Web of Science, PubMed, the Ntzani and Ioannidis 2003 Lancet paper, and the author's own investigations) are at http://www.researchremix.org/data/PLoSONE2007%20Piwowar%20Data.zip
Abstract: The public sharing of primary research datasets potentially benefits the research community but is not yet common practice. In this pilot study, we analyzed whether data sharing frequency was associated with funder and publisher requirements, journal impact factor, or investigator experience and impact. Across 397 recent biomedical microarray studies, we found investigators were more likely to publicly share their raw dataset when their study was published in a high-impact journal, when their study was published in a journal with an enforceable data-sharing requirement, and when the first or last authors had high levels of career experience and impact. We estimate the NIH data sharing policy applied to only 19% of the studies in our cohort; being subject to the NIH data sharing plan requirement was not found to correlate with increased data sharing behavior in multivariate logistic regression analysis. Studies published in journals that required a database submission accession number as a condition of publication were more likely to share their data, but this trend was not statistically significant. These early results will inform our ongoing larger analysis, and hopefully contribute to the development of more effective data sharing initiatives.
Notes: Raw data: http://www.researchremix.org/wordpress/wp-content/uploads/2009/09/Piwowar_Metrics2009_rawdata.csv
Statistics file: http://www.researchremix.org/wordpress/wp-content/uploads/2009/09/Piwowar_Metrics2009_statistics.R
Presentation: http://www.slideshare.net/hpiwowar/metrics2009-piwowar-presentation-20091016key
Abstract: Science progresses by building upon previous research. Progress can be most rapid, efficient, and focused when raw datasets from previous studies are available for reuse. To facilitate this practice, funders and journals have begun to request and require that investigators share their primary datasets with other researchers. Unfortunately, it is difficult to evaluate the effectiveness of these policies. This study aims to develop foundations for evaluating data sharing and reuse decisions in the biomedical literature by developing tools to answer the following research questions, within the context of biomedical gene expression datasets: What is the prevalence of biomedical research data sharing? Biomedical research data reuse? What features are most associated with an investigator’s decision to share or reuse a biomedical research dataset? Does sharing or reusing data contribute to the impact of a research article, independently of other factors? What do the results suggest for developing efficient, effective policies, tools, and initiatives for promoting data sharing and reuse? I suggest a novel approach to identifying publications that share and reuse datasets, through the application of natural language processing techniques to the full text of primary research articles. Using these classifications and extracted covariates, univariate and multivariate analysis will assess which features are most important to data sharing and reuse prevalence, and also estimate the contribution that sharing data and reusing data make to a publication’s research impact. I hope the results will inform the development of effective policies and tools to facilitate this important aspect of scientific research and information exchange.
Abstract: Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.
Abstract: Background: Dataset submissions are growing exponentially. Links between dataset submissions and primary literature that describe the data collection are useful for many reasons: rich documentation, proper attribution, improved information retrieval, and enhanced text/data integration for analysis. Unfortunately, many database submissions do not include primary citation links, as database submissions are often made prior to publication. We suggest that automated tools can be developed to help identify links between dataset submissions and the primary literature. These tools require full text to differentiate cases of data sharing from data reuse and other contexts. In this study, we explore the possibility that deep analysis of full text may not be necessary, thereby enabling the querying of all reports in PubMed Central.
Methods: We trained machine learning tree and rule-based classifiers on full-text open-access article unigram vectors, with the existence of a primary citation link from NCBI’s Gene Expression Omnibus (GEO) database submission records as the binary output class. We manually combined and simplified the classifier trees and rules to create a query compatible with the interface for PubMed Central.
Results: The query identified 40% of non-OA articles with dataset submission links from GEO (recall), and 65% of the returned articles without dataset submission links were manually judged to include statements of dataset deposit despite having no link from the database (applicable precision).
Conclusion: We hope this work inspires future enhancements, and highlights the opportunities for simple full-text queries in PubMed Central given the mandated influx of NIH-funded research reports.
Notes: Presentation slides: http://www.slideshare.net/hpiwowar/presentations
Evaluation data from manual curation can be found at http://www.google.com/notebook/public/05528518921351292683/BDQkmSwoQ0IGz6pUj
Abstract: Sharing data is a tenet of science, yet commonplace in only a few subdisciplines. Recognizing that a data sharing culture is unlikely to be achieved without policy guidance, some funders and journals have begun to request and require that investigators share their primary datasets with other researchers. The purpose of this study is to understand the current state of data sharing policies within journals, the features of journals that are associated with the strength of their data sharing policies, and whether the strength of data sharing policies impact the observed prevalence of data sharing. Methods: We investigated these relationships with respect to gene expression microarray data in the journals that most often publish studies about this type of data. We measured data sharing prevalence as the proportion of papers with submission links from NCBI’s Gene Expression Omnibus (GEO) database. We conducted univariate and linear multivariate regressions to understand the relationship between the strength of data sharing policy and journal impact factor, journal subdiscipline, journal publisher (academic societies vs. commercial), and publishing model (open vs. closed access). Results: Of the 70 journal policies, 53 made some mention of sharing publication-related data within their Instruction to Author statements. Of the 40 policies with a data sharing policy applicable to gene expression microarrays, we classified 17 as weak and 23 as strong (strong policies required an accession number from database submission prior to publication). Existence of a data sharing policy was associated with the type of journal publisher: 46% of commercial journals had data sharing policy, compared to 82% of journals published by an academic society. All five of the openaccess journals had a data sharing policy. Policy strength was associated with impact factor: the journals with no data sharing policy, a weak policy, and a strong policy had respective median impact factors of 3.6, 4.9, and 6.2. Policy strength was positively associated with measured data sharing submission into the GEO database: the journals with no data sharing policy, a weak policy, and a strong policy had median data sharing prevalence of 8%, 20%, and 25%, respectively. Conclusion: This review and analysis begins to quantify the relationship between journal policies and data sharing outcomes. We hope it contributes to assessing the incentives and initiatives designed to facilitate widespread, responsible, effective data sharing.
Notes: Presentation slides: http://www.slideshare.net/hpiwowar/presentations
Archived Instructions to Authors statements are at http://www.researchremix.org
/data/ELPUB2008%20Piwowar%20InstructionsForAuthors.zip
Data is at http://www.researchremix.org/data/ELPUB2008%20Piwowar%20Data.zip
Statistical analysis code (r script) is at http://www.researchremix.org/data/ELPUB2008%20Piwowar%20Stats.r
Abstract: Literature searches, systematic reviews, and text mining require identifying articles based on full-text content.
The full text of published biomedical articles contain valuable information not found in abstracts or MeSH terms.
Full-text literature is increasingly available for query.
PubMed Central, Highwire Press and Google Scholar are growing fast, thanks to the NIH public access mandate.
However, it is difficult to formulate effective full-text queries manually.
Prose and identifiers have large variation, and full-text portals are not designed for query evaluation.
Current full text retrieval research does not address this problem.
Cutting-edge systems developed for information retrieval and extraction require complete computational access to a full-text corpora for preprocessing: publisher licenses rarely allow this.
We propose using open access literature to formulate queries for use in full-text portals.
We can use open access articles to identify synonyms and lexical variants, tune performance, and generate queries compatible with full-text portal query languages.
Abstract: Repurposing research data holds many benefits for the advancement of biomedicine, yet is very difficult to measure and evaluate. We propose a data reuse registry to maintain links between primary research datasets and studies that reuse this data. Such a resource could help recognize investigators whose work is reused, illuminate aspects of reusability, and evaluate policies designed to encourage data sharing and reuse.
Abstract: Sharing research data is a cornerstone of science. Although many tools and policies exist to encourage
data sharing, the prevalence with which datasets are shared is not well understood. We report our
preliminary results on patterns of sharing microarray data in public databases.
Notes: Data and calculations: http://www.researchremix.org/data/PSB2008%20Piwowar%20Data.zip
Abstract: Although not all research topics can be addressed by re-using existing data, many can. Identifying areas with frequent re-use can highlight best practices to be used when developing research agendas, tools, standards, repositories, and communities in areas which have yet to receive major benefits from shared data.
Notes: Data and calculations: http://www.researchremix.org/data/ISMB2007%20Piwowar%20Data.zip
Abstract: Presented at ASIS&T 2009 in the student awards section. The presentation contains an overview of my dissertation proposal, as 2009 winner of the Thomson Reuters Information Science Doctoral Dissertation Proposal Scholarship, administered by the ASIS&T Information Science Education Committee.
Abstract: Why measure the adoption of Open Science?
As we seek to embrace and encourage participation in open science, understanding patterns of adoption will allow us to make informed decisions about tools, policies, and best practices. Measuring adoption over time will allow us to note progress and identify opportunities to learn and improve. It is also just plain interesting to see where we are, where we aren’t, and where we might go!
What can we measure?
Many attributes of open science can be studied, including open access publications, open source code, open protocols, open proposals, open peer-review, open notebook science, open preprints, open licenses, open data, and the publishing of negative results. This presentation will focus on measuring the prevalence with which investigators share their research datasets.
What measurements have been done? How? What have we learned?
Various methods have been used to assess adoption of open science: reviews of policies and mandates, case studies of experiences, surveys of investigators, and analyses of demonstrated data sharing behavior. We’ll briefly summarize key results.
Future research?
The presentation will conclude by highlighting future research areas for enhancing and applying our understanding of open data adoption.
Abstract: A presentation to the DBMI department at the University of Pittsburgh about data sharing and reuse: what this means, why it is important, some of what we’ve learned, and what we still don’t know.
Abstract: Heather A Piwowar, Roger S Day, Douglas B Fridsma (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate PLoS ONE 2: 3. e308
Abstract: Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for the investigator who makes his or her data available. We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression. This correlation between publicly available data and increased literature impact may further motivate investigators to share their detailed research data
Abstract: When looking for citation indices within biomedicine for aggreggate analysis, we suggest calculating them using data from PubMed, PubMed Central, and the Author-ity name disambiguation engine. We call this the "pubmedi" approach, and explore it further here.
Abstract: We conducted a pilot annotation study with Amazon’s Mechanical Turk to estimate the accuracy with which annotation tasks can be performed by this group of non-experts, the number of independent annotations necessary to get sufficient generalizability, and the cost of annotation.
Abstract: The realization of a new wireline acquisition front end has made it possible for Schlumberger to redesign its uphole telemetry receiver. In order to achieve data rates of 500 kbits/second over standard oil well logging cables, the Digital Telemetry System uses Quatrature Amplitude Modulation (QAM) to transmit its measurement data. The demodulator involves timing recovery, filtering, cable equalization, and symbol decoding. The purpose of this thesis is to simulate the demodulator, thereby documenting the demodulation process and creating a design tool that can be used to design future QAM telemetry systems.
Notes: Completed in fulfillment of a Masters of Engineering degree, on co-op assignment at Schlumberger Austin Research Center.