Table of Contents: 2025 JULY - AUGUST No. 465
Does Screen Scraping ClinicalTrials.gov Work? NLM Tech Bull. 2025 Jul-Aug;(465):e4.
Screen scraping involves extracting data from a website by mimicking the actions a user would take when interacting with the website, such as clicking buttons and moving through pages. Data is captured through the visual content found on the user interface or from the HTML code. This technique is used when direct access to a website's data through an API isn't available to compare data from different sources or to get to data that isn't otherwise easily available. Screen scraping works by using a combination of different software programs and character recognition technology to collect data from a website.
Some end users and organizations have used screen scraping tools on ClinicalTrials.gov in an attempt to extract data from a single study or obtain data from a group of studies. The cURL command is a popular, open-source command line utility for interacting with servers that can be used to extract data from websites. However, when the cURL command is used to try to access data from a single study on ClinicalTrials.gov, it provides limited results. This limitation happens because the modernized ClinicalTrials.gov is a Single Page Application (SPA). An SPA is a website that has only one HTML page that constantly updates based on user interactions. When a user attempts to extract data from ClinicalTrials.gov using a screen scraping technology, the response for any URL request is not the actual HTML page, but bootstrap javascript code, which is the code used by the web browser to assemble and present a fully functional webpage containing data about the study.
The best way to obtain data about a single study is to use the ClinicalTrial.gov open-access API.
In the CURL tab, you will see a URL. An example of this is below.
$ curl -X GET "https://clinicaltrials.gov/api/v2/studies/NCT02993146" |
Now the output is the actual usable study data in JSON format.
StudyIdInfo":{"id":"212494"},"secondaryIdInfos":[{"id":"2020-000753-28","type":"EUDRACT_NUMBER"}],"organization":{"fullName":"GlaxoSmithKline","class":"INDUSTRY"},"briefTitle":"Efficacy Study of GSK's Investigational Respiratory Syncytial Virus (RSV) Vaccine in Adults Aged 60 Years and Above","officialTitle":"A Phase 3, Randomized, Placebo-controlled, Observer-blind, Multi-country Study to Demonstrate the Efficacy of a Single Dose and Annual Revaccination Doses of GSK's RSVPreF3 OA Investigational Vaccine in Adults Aged 60 Years and Above"},"statusModule":{"statusVerifiedDate":"2024-09","overallStatus":"COMPLETED","expandedAccessInfo":{"hasExpandedAccess":false},"startDateStruct":{"date":"2021-05-25","type":"ACTUAL"},"primaryCompletionDateStruct":{"date":"2022-04-11","type":"ACTUAL"}, |
Some users have scraped ClinicalTrials.gov to try to extract data on a specific disease or condition. They do this with an automated process that repeatedly enters a condition into the search box on the main search page at a frequency that far exceeds human capabilities.
To obtain data about clinical studies for a specific condition or disease using the ClinicalTrials.gov API, start by going to the Studies section (Figure 4) of the ClinicalTrials.gov REST API and scroll down to the REQUEST section. Put the name of a condition, such as "gall bladder cancer," into the query.cond field.
Click the TRY button at the bottom of the section. It may take a few seconds for the JSON format to be rendered under RESPONSE. On the CURL tab, you can see the command line for the curl utility (Figure 5). You can use this to automate the data collection.
If you are using another HTTP client, you will need to do an HTTP GET request to the specified URL.
More information about viewing different pages of study data can be found in the Studies section (Figure 6). If you are requesting a very large amount of data and it exceeds the pageSize studies (the default value is 10), please read the notes about the use of pageToken to learn what you need to do to get the complete data set.
The ClinicalTrials.gov REST API is publicly available to provide users with metadata and statistics on the most up-to-date version of the clinical studies found on ClinicalTrials.gov. It provides a convenient and easy way to get data from the ClinicalTrials.gov website. This method is preferable to screen scraping techniques, which are far more laborious and less likely to provide the desired results.