A Proquest platform API

We subscribe to a number of databases via Proquest.

I wanted an API for having my software execute fielded searches against a Proquest a database — specifically Dissertations and Theses in my current use ase — and get back structured machine-interpretable results.

I had vaguely remembered hearing about such an API, but was having trouble finding any info about it.

It turns out, while you’ll have trouble finding any documentation about it, or even any evidence it exists on the web, and you’ll have trouble getting information about it from Proquest support too — such an api does exist. Hooray.

You may occasionally see it called the “XML Gateway” in some Proquest documentation materials (although Proquest support doesn’t neccesarily know this term). And it was probably intended for and used by federated search products — which makes me realize, oh yeah, if I have any database that’s used by a federated search product, then it’s probably got some kind of API.

And it’s an SRU endpoint.

(Proquest may also support z39.50, but at least some Proquest docs suggest they recommend you transition to the “XML Gateway” instead of z39.50, and I personally find it easier to work with then z39.50).

Here’s an example query:

http://fedsearch.proquest.com/search/sru/pqdtft?operation=searchRetrieve&version=1.2&maximumRecords=30&startRecord=1&query=title%3D%22global%20warming%22%20AND%20author%3DCastet

For me, coming from an IP address recognized as ‘on campus’ for our general Proquest access, no additional authentication is required to use this API. I’m not sure if we at some point prior had them activate the “XML Gateway” for us, likely for a federated search product, or if it’s just this way for everyone.

The path component after “/sru”, “pqdtft” is the database code for Proquest Dissertations and Theses. I’m not sure where you find a list of these database codes in general; if you’ve made a succesful API request to that endpoint, there will be a <diagnosticMessage> element near the end of the response listing all database codes you have access to (but without corresponding full English names, you kind of have to guess).

The value of the ‘query’ parameter is a valid CQL query, as usual for SRU. It can be a bit tricky figuring out how to express what you want in CQL, but the CQL standard docs are decent, if you spend a bit of time with them to learn CQL.

Unfortunately, there seems to be no SRU “explain” response available from Proquest to tell you what fields/operators are available. But guessing often works, “title”, “author”, and “date” are all available — I’m not sure exactly how ‘date’ works, need to experiment more — although doing things like `date > 1990 AND date <= 2010` appears initially to work.

The CQL query param above un-escaped is:

title="global warming" AND author=Castet

Responses seem to be in MARCXML, and that seems to be the only option.

It looks like you can tell if a full text is available (on Proquest platform) for a given item, based on whether there’s an 856 field with second indicator set to “0” — that will be a URL to full text. I think. It looks like.

Did I mention if there are docs for any of this, I don’t have them?

So, there you go, a Proquest search API!

I also posted this to the code4lib listserv, and got some more useful details and hints from Andrew Anderson.

Oh, and if you want to link to a document you found this way, one way that seems to work is to take the Proquest document ID from the marc 001 field in the response, and construct a URL like `http://search.proquest.com/pqdtft/docview/$DOCID$`. Seems to work. Linking to fulltext if it’s available otherwise a citation page. Note the `pqdtft` code in the URL, again meaning ‘Proquest Dissertations and Theses’ — the same db I was searching to find to the doc id.

5 thoughts on “A Proquest platform API”

Specifically asking Proquest support for the “Federated-Search.docx” document, as suggested by Andrew Anderson, did get me that document.

It mostly doesn’t tell us anything we hadn’t already figured out above, but has a few useful expansions and new details.

One of them is that apparently there are special database codes for searching certain pre-selected groupings of multiple databases at once, like “all_subscribed”, and “subject_arts”. I don’t need that for my present use case, but it is potentially crucial for others; I did try “all_subscribed” and it appeared to work. Neat!

It also includes the list of search fields it says are supported. And confirms that included MARC 856’s will have their second indicator set appropriately, such that “0” is always full text, and “1” never is.

Probably worth asking for this document if you plan to develop against the SRU api.

Hi,
My name is Elliot and I am a student at Columbia. I am doing research on Citibank shareholders in the 1920s at the business school.

I came across your blog while searching for the proquest API.

I am trying to mass search shareholder names and obtain obituaries on them. I simply need the document id or document identifying information.

noticed that you were able to mass search documents in proquest. I have been trying to do this using python and haven’t been successful. First of all, it isn’t exactly clear how to access their API.

Can you provide some guidance on using the summon API to mass search documents? My programming background is in Python and R, although my background is in social science so I am not really familiar with the technical aspects of the API.

I would greatly appreciate any feedback. Thank you.

It’s 2021 and this information is *still* helpful (and necessary from the lack of info on a Proquest API that I’ve been able to find), haha. Thank you so so much for this blog post. I’ve been searching for a way to access the search results on Proquest for days.

~A desperate UChicago Researcher

Kat, were you able to get access to the Proquest “XML Gateway” somehow? Please let us know if so and how you got them to give it to you! Did you go through your library?

Hi Jonathan! Sorry it took so long to get back, I was revisiting this post and just noticed your reply. Somehow didn’t get any email about your comment. I just needed metadata info from search result pages like that which you included as the example above.

(http://fedsearch.proquest.com/search/sru/pqdtft?operation=searchRetrieve&version=1.2&maximumRecords=30&startRecord=1&query=title%3D%22global%20warming%22%20AND%20author%3DCastet)

I had no authentication issue accessing (and scraping) the XML from these. I have a list of authors I needed dissertation search results for (with a set of subjects as additional query params), so it was just a matter of constructing URLs based on the given endpoint. I have no additional Proquest access beyond what you detail above (unless you count academic VPN).

Hope that answers your question!

Slight sidenote:
I did end up going through and downloading PDFs of the search results, but I did that by running a script going through the links provided in the XML from the fedsearch endpoint. Threw a few captchas here and there, but not that many.

A Proquest platform API

Published by jrochkind

5 thoughts on “A Proquest platform API”

Leave a comment

Share this:

Published by jrochkind

5 thoughts on “A Proquest platform API”

Leave a comment