Copy, Shake, and Paste: Computational Research Integrity 2021

This week I am attending (and presenting at) the Computational Research Integrity Conference 2021 that is sponsored be the US Office of Research Integrity. I will try and record the highlights here.

The purpose of this conference is to bring computer scientists together with RIOs (Research Integrity Officers) so that a good discussion and exchange about tools for dealing with research integrity issues. The conference was organized by Daniel Acuna from Syracuse University.

Day 1: 23 March 2021 - Day 2 - Day 3

Ranjini Ambalavanar, the Acting Director of the Division of Investigative Oversight at ORI kicked off the conference explaining the workflow at ORI from allegation to decision. It takes a long time, and there are many things to think about, from saving a copy of perhaps a terabyte of data to discovering small discrepancies. She showed us a few cases of really blatant fabrication of data and falsification of images, some of which can be found with simple statistical analysis or image investigation. ORI has a list with some forensic tools they use to produce evidence for their cases. She pleaded with computer scientists to produce more and better tools.
Jennifer Byrne presented some research that she is doing with Cyril Labbé on gene sequences that are used in cancer research. They found numerous papers that said they were using specific genes for some purpose, but that they were actually not using them correctly or were stating that they were using one gene but actually using another. Genes are expressed with long strings of characters representing the bases involved. These sequences are not easily human-understandable, but are easy to find in publications. They have a tool, "Seek / Blastn" that looks for nucleotide sequences in PDFs and querys the sequence against the human genomic + transcript database to output a human-readable name for the sequences that help show up problems.
Lauren Quakenbush & Corinna Raimundo are RIOs from Northwestern University. They train young researchers with RI bootcamps and gave us some good insights into how research misconduct investigations are done for serious deviations at a university in the USA. They have many new issues that are arising: an increasing volume of data that needs to be sequestered (terabytes!), unrealistic time frames, measures to protect the identity of whistleblowerd, determining responsibility and intent, co-authors who are at other institutions, respondents who leave the university, the litigous nature of the cases, communication with journals, and so on. Germany really needs to see that they need staff and resources and not just name a lone RIO....
Kyle Siler gave a short presentation about predatory publishing. He began making it clear that predatory publishing is not a binary attribute, but quite a spectrum of problematic publishing. He spoke of some fascinating investigations that he is doing in trying to identify what is meant by a predatory publisher. He scraped a large database of metadata from various publishers and is trying to measure some things like time-to-publish and number of articles published per year. His slides flew by so fast and I was so engrossed that I forgot to take any snapshots. He found one very strange oddity while cleaning his data: a presumably predatory journal that scraped an article from a reputable journal with Global North authors, and reprinted it. BUT: they made some odd formating mistakes and some VERY odd substitutions (like the first name "Nancy" becoming "urban center"). He assumes that the journal is trying to build up an image of looking respectable in order to gain APC-paying customers. Some are even back-dated, so that the true publication looks like a duplicate publication, or even a plagiarism.
Edward J. Delp described a tool for image forensics that he is developing with a large research group at Purdue + other governmental organizations, in particular with Wanda Jones from ORI. His Scientific Integrity System seems to be just what many of the RIOs need, they wanted to know when he will be releasing the system! The problem is that it can probably only be used for people working for the US government, not for real cases, for legal reasons apparently involving the US military. But he has a user manual online: https://skynet.ecn.purdue.edu/~ace/si/manual/user-manual-scientific-integrity-v5.pdf and a demo video: https://skynet.ecn.purdue.edu/~ace/si/video/sci-int-system-demo_v5.mp4. He uses Generative Adversarial Networks to produce synthetic data for training his neural networks. They use retracted papers with images and non-retracted ones for populating their database.
David Barnes noted that getting annotations off of PDFs is not easy, Ed replied that it indeed hard, but his group knows how to do it!
Update 2021-03-26: Wanda wrote to me to make it clear that it is of course an entire team of people at ORI and NIH who are working with Ed on this project. She also notes:
"the reason we’re not using it on active ORI cases is because of evidence integrity standards, and a federal computing requirement that we operate within the HHS Cloud environment with anything involving personally identifiable information (PII). New systems must undergo rigorous testing before they can “go live” in our internal environment. (Even commercial products must be reviewed, though it’s not as arduous as a newly-developed product.) Purdue hosts the system in its own secure cloud, but we cannot put information that might identify anyone named in our active cases into a non-HHS system. We have full freedom to develop what we need, though, using the thousands of published/retracted PDFs and other file formats that Ed and his team have assembled, including a growing library of GANS-generated images. We couldn’t be more excited about where this is going, and we’re hopeful we can go live in the next year or so. We’re exploring how best to do that.

Further, we’re not restricted because of any military uses of the technology – everything being used with it is published and/or open-source and has been, for years. We’re merely benefitting from DARPA’s years of investment in technology (albeit for national security purposes) that clearly has other worthwhile uses. After all, DARPA gave us the internet (for better or worse)! "
Thank you, Wanda, for clearing this up!
Panel 1: We segued right into the first panel with institutional investigators. [Note to the conference organizers: More and longer breaks needed!] Wanda Jones, the deputy director of ORI moderated the session with William C. Trenkle from the US Department of Agriculture, Wouter Vandevelde from KU Leuven in Belgium, and Mary Walsh from Harvard Medical School. They again picked up on the problem of an overwhelming amount of data and file management they have to do. The panel members briefly presented the processes at their institutions. Will noted that there is no government-wide definition of scientific integrity, although I am very pessimistic on any government deciding on a definition of anything. I was impressed that they made it clear that any analyses are only done on copies, never the original data itself, and that the tools that they use only detect problems, they do not decide if academic misconduct has taken place. A lively discussion raged on in the chat, with Dorothy Bishop noting that the young researchers are the ones who come to research with high ideals and get corrupted as they work. Will noted that agriculture integrity issues are different from medical ones, stating that it is a bit more difficult to fake 2000-pound cows than mice. Ed offered to generate an image of one for him, I really want to see that! It was made clear that there has to be some senior person, be it an older academic or a vice president of research, that protects the RIOs when they are investigating cases, particularly if "cash cows" of the university are under suspicion. [Will got us started on cows, I wonder how many there will be tomorrow!] I asked what one thing the panelists would wish for if a fairy godmother was to come along and grant that wish. The wishes were for a one-stop image forensic tool, more resources, and the desire for people who commit FFP (Fabrication, Falsification, Plagiarism) to have their hands starting to burn 🙌. The chat started discussing degrees of burn, starting with a light burn for QRP (questionable research practices) and a harder burn for FFP 😂.
Panel 2: We were awarded a 5 minute break before the next round, I made it to the refrigerator for a hunk of cheese. Bernd Pulverer from EMBO Press was moderating the panel with Renee Hoch from PLOS, IJsbrand Jan Aalbersberg from Elsevier and Maria Kowalczuk from Springer Nature. Renee detailed the pre-publication checks that they run in order to siphon off as much problematic material as possible before publication. IJsbrand had some nice statistics about the causes for retractions from Elsevier journals. There are more author-reported scientific errors causing retractions, so this helps make it clear that retractions do not always mean academic misconduct. About 20 % of the retractions are for plagiarism and image manipulation is 10-20 %. Bernd was of the opinion that plagiarism is infrequent, so I was back over at my slides, which I must have changed 17 times during the talks, to include a statement that it is NOT infrequent, just not found. He noted that it costs about $8000 per article published in Nature, because so many are evaluated and rejected. An interesting question from Dorothy Bishop was: What do we do if editors are corrupt? There was much discussion in the chat about how to find an appropriate address to report issues and how journals cooperate with institutions. A number of people want to move to versioning in publishing, something I find abhorrent unless there is a wiki-like way of being able to specify the exact version of an article that you are quoting. IJsbrand had a list of twelve (!) grades of ~~retraction~~ corrective tools ranging from in-line correction to retraction. The fairy godmother was now granting two wishes, which brought out things like a fingerprinting or microattribution tool (it's called a wiki, people, you can see exactly who did what when to the article), a user verification tool (sometimes people are listed as authors who do not know that they are being listed), an iThenticate for images, and so on. It was also noted that once the NIH started suing universities for research misconduct, they started perking up and getting on with research integrity training. Money focuses university administration minds!

There were various interesting links that popped up in the chat on Day 1, I gave up trying to put them in context and just started a bullet list here:

Kyle Siler, Philippe Vincent-Lamarre, Cassidy R. Sugimoto and Vincent Larivière, "The Lacuna Database: Empirical Data to Identify Obscure, Unconventional, Questionable and/or Predatory Journals"
2018 UK Parliamentary report on Research Integrity - I think every country needs one of these, especially Germany! The oral evidence of Dorothy Bishop is great.
A Nature news feature from 23 March 2021: The fight against fake-paper factories that churn out sham science
NIST Digital/Multimedia Scientific Area Committee
Reducing the Inadvertent Spread of Retracted Science - Paper: Reducing the Inadvertent Spread of Retracted Science: Shaping a Research and Implementation Agenda
Preprint: Towards minimum reporting standards for life scientists
bioRxive: Amending published articles: time to rethink retractions and corrections?
Hot topic: The Economics of Reproducibility in Preclinical Research: "An analysis of past studies indicates that the cumulative (total) prevalence of irreproducible preclinical research exceeds 50%, resulting in approximately US$28,000,000,000 (US$28B)/year spent on preclinical research that is not reproducible—in the United States alone." - Cochrane Collaboration used figure of USD170 billion (2019) here: https://www.cochrane.org/news/apply-now-cochrane-reward-prize-reducing-waste-research - This paper from 2014 reported average ~$400k costs per retracted paper, in wasted grant money... https://elifesciences.org/articles/02956 - From https://www.bmj.com/content/308/6924/283 (1994). See also Chalmers, Glasziou http://doi.org/10.1136/bmj.k4645 (2018) and http://doi.org/10.1097/AOG.0b013e3181c3020d (2009) - Elizabeth Gammon has done good work on economics of misconduct (using retracted articles), e.g. Gammon, E., & Franzini, L. (2013). Research misconduct oversight: Defining case costs. Journal of Health Care Finance, 40(2), 75–99. - her related dissertation https://mdsoar.org/handle/11603/4071 - [Jodi Schneider] been doing a scoping review of empirical literature about retracted research - there’s a bibliography (up to April 2020) here: https://infoqualitylab.org/projects/risrs2020/bibliography/ and we’re currently screening items up to Feb 2021. I’d love to share what we’ve found with anybody who wants to look more into what’s known about economics of misconduct (based on studies of retracted papers), email jodi@illinois.edu if you want to discuss!
Renee Hoch mentioned a FORCE11 initiative on research data publication ethics, that’s this: https://www.force11.org/group/research-data-publishing-ethics
Example of folks working on training, in the US - National Center for Professional & Research Ethics https://ethicscenter.csl.illinois.edu

I'm exhausted and heading for bed, looking forward to day two!

Copy, Shake, and Paste

Wednesday, March 24, 2021

Computational Research Integrity 2021 - Day 1

No comments:

Post a Comment

Search This Blog