Friday, June 11, 2021

ECAIP 2021 - Day 3

 What? It's Friday already? Last day of the conference! Lots of good stuff still to hear.

Day 0 - Day 1 - Day 2 - Day 3

Thursday, June 10, 2021

ECAIP 2021 - Day 2

European Conference on Academic Integrity and Plagiarism 2021

Day 0 - Day 1 - Day 2 - Day 3 

Wednesday, June 9, 2021

ECAIP 2021 - Day 1

It's conference time again! The European Conference for Academic Integrity and Plagiarism 2021 (#ECAIP2021 on Twitter, organized by the European Network for Academic Integrity) has started. I will try to make a few notes on the talks for those who are unable to attend. My plan was to do the blogging on my new iPad and use my Mac with a second screen for attending the conference, but Google won't let me into the blog if I don't give it telephone numbers and shoe sizes, so I'll have to be doing some juggling here.

Day 0 - Day 1 - Day 2 - Day 3

Thursday, March 25, 2021

Computational Research Integrity - Day 3

Uff. All-day Zoom is tough. We are now on the last day of the Computational Research Integrity Conference 2021.  More exciting talks to come!
[This ended up to be a mess of links, many apologies. But they are good links, so I am focusing more on documenting them instead of summarizing the talks. I probably have some links in twice, sorry about that.]

Day 1 - Day 2 - Day 3: 25 March 2021

Computational Research Integrity - Day 2

All right! I slept in this morning to try and have my body be in New York time and not Berlin time. Looking forward to the talks today, I will be on second after Elisabeth Bik. I changed my slides about 17 times yesterday to adapt to the discussions, it's about time I give the talk. 

Day 1 - Day 2, 24 March 2021 - Day 3

  • Elisabeth Bik, the human image duplication spotter, gave us some great stories: How she got started on this (a plagiarism of her own work), what tools she uses, what tools she wishes she had, and even gave us some images to try and spot ourselves. On her Twitter feed (@MicrobiomDigest) she runs an #imageforensics contest. I'm ususally too slow to respond to them. What really puzzles me is: Why are people messing with the images? Why not do the experiments for real? Or if you must fake, use a different picture? We just need to let her get her hands on Ed Delp's tool! That would bring her superpowers up to warp speed!
  • I was up next with "Responsible Use of Support Tools for Plagiarism Detection", Elizabeth did a great tweet thread on the talk, thanks! I referred to Miguel Roig's work on self-plagiarism in response to a discussion yesterday. Here's our paper on the test of support tools for plagiarism detection and our web-page with all the gory details. And of course, the similarity-texter, a tool for comparing two texts. Sofia Kalaidopoulou implemented it as her bachelor's thesis. It is free, works in your browser, and nicely colors same text so the differences jump out and hit you in the eye.
  • Michael Lauer from the National Institute of Health then spoke about "Roles and Responsibilities for Promoting Research Integrity." He fired off a firework of misconduct cases that had to do with things like exfiltrating knowledge and research to China or misusing NIH funds with wich I couldn't keep up. Some of the schemes were really brazen! A few that I got noted: The Darsee Affair in the 1980s (Article in the New England Journal of Medicine) - an internal peer-review tampering case - Duke University affair around Anil Potti - Chinese Researcher Sentenced for Making False Statements to Federal Agents. Espionage seems to be a really big problem!
  • Matt Turek Information Innovation Office (I2O), Program Manager at DARPA, spoke on "Challenges and Approaches to Media Integrity." He calmly and matter-of-factly presented some absolutely TERRIFYING, bleeding-edge research on image generation. We had seen some things Ed Delp spoke about yesterday. But things like a Deepfakes video of Richard Nixon appearing to read a speech that was written in case the moon shot (the Apollo 11 mission, I watched this in black and white on my grandmother's TV) ended in tragedy makes me despair that we will ever manage to deal with fake news. Nixon's lips move to the text he is reading, it is almost impossible to tell that this is a fake - except that I know that I saw a different ending in my youth. Matt ended with the possibility of "Identity Attacks as a Service", that is, ransomware that threatens to publish real-looking videos of someone unless they pay up. I'm glad his time was up, afraid that he would have more deeply unsettling things to show. Much as I personally do not agree with a lot that the military is wasting money on, this seems to be a good investment.
  • Zubair Afzal spoke on  "Improving reproducibility by automating key resource tables", I have no idea what key resource tables are, but it seemed to be useful to biomedical researchers. 
  • Colby Vorland, with "Semi-automated Screening for Improbable Randomization in PDFs", attempted to see if data makes sense by looking at the distribution of p values, which should be random. (Note from Elisabeth Bek: See e.g. Carlisle's work on p values in 5,000 RCTs). He has to go to enormous trouble to scrape table data out of PDFs. I suggest using Abbyy FineReader, which does a good job of OCRing tables. Why, oh why do PDFs not have semantic markup?
  • Panel 3: Funders
    Benyamin Margolis (ORI), Wenda Bauchspies (NSF), Michael Lauer (NIH), and Matt Turek (DARPA) discussed various aspects of the funding of research integrity research. All sorts of topics were addressed with the links flying in the chat as usual:
    Report Fraud, Waste, Abuse, or Whistleblower Reprisal to the NSF OIG - A link to help PIs prepare to teach or learn more about RCR. - NIH Policy for Data Management and Sharing - Deep Nostalgia - The Heilmeier Catechism - Find US government funding - Build and Broaden for encouraging diversity, equity and inclusion - DORA. The tabs I still have open probably came from this session, they are in the bullet list below. 
  • Daniel Acuna and Benyamin Margolis introduced a competition: Artificial Intelligence for Computational Research Integrity. ORI is offering a grant (ORIIR200062 Large-scale High-Quality Labeled Datasets and Competitions to Advance Artificial Intelligence for Computational Research Integrity) for running the competition.
  • Panel 4: Tool Developers
    Daniel Acuna (Syracuse University), Jennifer Byrne (University of Sydney), James Heathers (Cipher Skin), and Amit K. Roy-Chowdhury (UC Riverside) were discussing.
    Jennifer and Cyril Labbé have published their protocol for using Seek & Blastn at And they have a paper on biomedical journal responses that closely mirrors my own experiences.
    James talked about his four projects GRIM (Preprint), SPRITE (Preprint), DEBIT, and RIVETS. His statistical work should scare the daylights out of data fabricators. As he points out: by the time they falsify their data to fit the statistical models, they might as well have done the experiments.
    Amit spoke a bit more in depth about the work Ghazal presented yesterday and the challenges involved in developing an image analysis tool.
    Daniel talked about Dr. Figures (Preprint)
    Someone (I didn't catch who, James?) said "Death to PDF!" Indeed, or rather, it needs to be easily parseable so that we can easily mine metadata, get the text and images separated, etc. Cyril posted a link to a good PDF extractor in the chat, I shall look into this very soon. 

Links to things in tabs I still have open that someone put in the chat at some time:

And now for a terribly geeky note on my talk. I have been bothered by presenting online with Zoom that I couldn't have my notes. I use Keynote on a Macbook Pro, and it either assumes the second screen is a beamer (and I can't talk it out of it), or I can only present on my laptop. And I either share the laptop or the second screen on the Mac. There has to be a better way! So I googled yesterday. And found this lovely article with the exact solution to my problem: How to use Keynote’s new Play Slideshow in Window feature with videoconferencing services.

I had just upgraded my iPad to a new operating system, so my Mac (Catalina) needed to install some do-hickey. Then all I had to do was: Start Keynote sharing in the window of my laptop, then share only Keynote on Zoom, and klick on the little remote thingy on Keynote on the iPad. I now set the iPad down on my keyboard, and I had the audience on Zoom (and myself to make sure I'm still in the camera view when speaking) on my second screen behind the laptop, my slides on the laptop screen, and on the iPad I selected the presentation of my notes and the next slide! How utterly perfect! I just needed to tap anywhere on the iPad to advance the slide. If I needed to go back, I could tap on the slide number and it would open up a long string of slides for me to choose how far back I wanted to go. It felt so good being in complete control, although I didn't have any brain cells left to read the chat, as I normally do when presenting. I'll learn once I can relax that this really does work. So thank you


Wednesday, March 24, 2021

Computational Research Integrity 2021 - Day 1

This week I am attending (and presenting at) the Computational Research Integrity Conference 2021 that is sponsored be the US Office of Research Integrity. I will try and record the highlights here.

The purpose of this conference is to bring computer scientists together with RIOs (Research Integrity Officers) so that a good discussion and exchange about tools for dealing with research integrity issues. The conference was organized by Daniel Acuna from Syracuse University. 

Day 1: 23 March 2021 - Day 2 - Day 3

  • Ranjini Ambalavanar, the Acting Director of the Division of Investigative Oversight at ORI kicked off the conference explaining the workflow at ORI from allegation to decision. It takes a long time, and there are many things to think about, from saving a copy of perhaps a terabyte of data to discovering small discrepancies. She showed us a few cases of really blatant fabrication of data and falsification of images, some of which can be found with simple statistical analysis or image investigation. ORI has a list with some forensic tools they use to produce evidence for their cases.  She pleaded with computer scientists to produce more and better tools.
  • Jennifer Byrne presented some research that she is doing with Cyril Labbé on gene sequences that are used in cancer research.  They found numerous papers that said they were using specific genes for some purpose, but that they were actually not using them correctly or were stating that they were using one gene but actually using another. Genes are expressed with long strings of characters representing the bases involved. These sequences are not easily human-understandable, but are easy to find in publications. They have a tool, "Seek / Blastn" that looks for nucleotide sequences in PDFs and querys the sequence against the human genomic + transcript database to output a human-readable name for the sequences that help show up problems.
  • Lauren Quakenbush & Corinna Raimundo are RIOs from Northwestern University. They train young researchers with RI bootcamps and gave us some good insights into how research misconduct investigations are done for serious deviations at a university in the USA. They have many new issues that are arising: an increasing volume of data that needs to be sequestered (terabytes!), unrealistic time frames, measures to protect the identity of whistleblowerd, determining responsibility and intent, co-authors who are at other institutions, respondents who leave the university, the litigous nature of the cases, communication with journals, and so on. Germany really needs to see that they need staff and resources and not just name a lone RIO....
  • Kyle Siler gave a short presentation about predatory publishing. He began making it clear that predatory publishing is not a binary attribute, but quite a spectrum of problematic publishing. He spoke of some fascinating investigations that he is doing in trying to identify what is meant by a predatory publisher. He scraped a large database of metadata from various publishers and is trying to measure some things like time-to-publish and number of articles published per year. His slides flew by so fast and I was so engrossed that I forgot to take any snapshots. He found one very strange oddity while cleaning his data: a presumably predatory journal that scraped an article from a reputable journal with Global North authors, and reprinted it. BUT: they made some odd formating mistakes and some VERY odd substitutions (like the first name "Nancy" becoming "urban center"). He assumes that the journal is trying to build up an image of looking respectable in order to gain APC-paying customers. Some are even back-dated, so that the true publication looks like a duplicate publication, or even a plagiarism.
  • Edward J. Delp described a tool for image forensics that he is developing with a large research group at Purdue + other governmental organizations, in particular with Wanda Jones from ORI. His Scientific Integrity System seems to be just what many of the RIOs need, they wanted to know when he will be releasing the system! The problem is that it can probably only be used for people working for the US government, not for real cases, for legal reasons apparently involving the US military. But he has a user manual online: and a demo video: He uses Generative Adversarial Networks to produce synthetic data for training his neural networks. They use retracted papers with images and non-retracted ones for populating their database.
    David Barnes noted that getting annotations off of PDFs is not easy, Ed replied that it indeed hard, but his group knows how to do it!
    Update 2021-03-26: Wanda wrote to me to make it clear that it is of course an entire team of people at ORI and NIH who are working with Ed on this project. She also notes:
    "the reason we’re not using it on active ORI cases is because of evidence integrity standards, and a federal computing requirement that we operate within the HHS Cloud environment with anything involving personally identifiable information (PII).  New systems must undergo rigorous testing before they can “go live” in our internal environment.  (Even commercial products must be reviewed, though it’s not as arduous as a newly-developed product.)  Purdue hosts the system in its own secure cloud, but we cannot put information that might identify anyone named in our active cases into a non-HHS system.  We have full freedom to develop what we need, though, using the thousands of published/retracted PDFs and other file formats that Ed and his team have assembled, including a growing library of GANS-generated images.  We couldn’t be more excited about where this is going, and we’re hopeful we can go live in the next year or so.  We’re exploring how best to do that.
    Further, we’re not restricted because of any military uses of the technology – everything being used with it is published and/or open-source and has been, for years.  We’re merely benefitting from DARPA’s years of investment in technology (albeit for national security purposes) that clearly has other worthwhile uses.  After all, DARPA gave us the internet (for better or worse)! "
    Thank you, Wanda, for clearing this up!
  • Panel 1: We segued right into the first panel with institutional investigators. [Note to the conference organizers: More and longer breaks needed!] Wanda Jones, the deputy director of ORI moderated the session with William C. Trenkle from the US Department of Agriculture, Wouter Vandevelde from KU Leuven in Belgium, and Mary Walsh from Harvard Medical School. They again picked up on the problem of an overwhelming amount of data and file management they have to do. The panel members briefly presented the processes at their institutions. Will noted that there is no government-wide definition of scientific integrity, although I am very pessimistic on any government deciding on a definition of anything. I was impressed that they made it clear that any analyses are only done on copies, never the original data itself, and that the tools that they use only detect problems, they do not decide if academic misconduct has taken place. A lively discussion raged on in the chat, with Dorothy Bishop noting that the young researchers are the ones who come to research with high ideals and get corrupted as they work. Will noted that agriculture integrity issues are different from medical ones, stating that it is a bit more difficult to fake 2000-pound cows than mice. Ed offered to generate an image of one for him, I really want to see that! It was made clear that there has to be some senior person, be it an older academic or a vice president of research, that protects the RIOs when they are investigating cases, particularly if "cash cows" of the university are under suspicion. [Will got us started on cows, I wonder how many there will be tomorrow!] I asked what one thing the panelists would wish for if a fairy godmother was to come along and grant that wish. The wishes were for a one-stop image forensic tool, more resources, and the desire for people who commit FFP (Fabrication, Falsification, Plagiarism) to have their hands starting to burn 🙌. The chat started discussing degrees of burn, starting with a light burn for QRP (questionable research practices) and a harder burn for FFP 😂.
  • Panel 2: We were awarded a 5 minute break before the next round, I made it to the refrigerator for a hunk of cheese. Bernd Pulverer from EMBO Press was moderating the panel with Renee Hoch from PLOS, IJsbrand Jan Aalbersberg from Elsevier and Maria Kowalczuk from Springer Nature. Renee detailed the pre-publication checks that they run in order to siphon off as much problematic material as possible before publication. IJsbrand had some nice statistics about the causes for retractions from Elsevier journals. There are more author-reported scientific errors causing retractions, so this helps make it clear that retractions do not always mean academic misconduct. About 20 % of the retractions are for plagiarism and image manipulation is 10-20 %. Bernd was of the opinion that plagiarism is infrequent, so I was back over at my slides, which I must have changed 17 times during the talks, to include a statement that it is NOT infrequent, just not found. He noted that it costs about $8000 per article published in Nature, because so many are evaluated and rejected. An interesting question from Dorothy Biship was: What do we do if editors are corrupt? There was much discussion in the chat about how to find an appropriate address to report issues and how journals cooperate with institutions. A number of people want to move to versioning in publishing, something I find abhorrent unless there is a wiki-like way of being able to specify the exact version of an article that you are quoting. IJsbrand had a list of twelve (!) grades of retraction corrective tools ranging from in-line correction to retraction. The fairy godmother was now granting two wishes, which brought out things like a fingerprinting or microattribution tool (it's called a wiki, people, you can see exactly who did what when to the article), a user verification tool (sometimes people are listed as authors who do not know that they are being listed), an iThenticate for images, and so on. It was also noted that once the NIH started suing universities for research misconduct, they started perking up and getting on with research integrity training. Money focuses university administration minds!

There were various interesting links that popped up in the chat on Day 1, I gave up trying to put them in context and just started a bullet list here:

I'm exhausted and heading for bed, looking forward to day two!

Saturday, March 13, 2021

Rector of Turkish university accused of plagiarism

There was an article in the German taz this weekend (13/14 March 2021) about the rector of the Boğaziçi University in Turkey that just briefly mentioned that there have been plagiarism allegations against him. Turns out that Elizabeth Bik already has done a deep dive into the allegations in her Science Integrity Digest blog. She has documented a substantial bit of plagiarism. Note: I didn't develop the similarity texter, that was the work of my student Sofia Kalaidopoulou, adapting and enhancing code published by Dick Grune. It is a great tool for documenting plagiarism, though!

There is a brief article in duvaR, an English-language news site about Turkey, and the Times Higher Education also reported on this in January 2021. 

The rector, of course, finds all this slander, stating that it's only about a few missing quotation marks and that citation styles have changed since he wrote his thesis.

Styles may change, but it has been the case for quite some time that you have to make a clear distinction between words by someone else and words from you. Just slamming a reference on the end of a paragraph or putting it in the literature list does not cut it.