Is that normal to get <html> tags in title metadata?


In some records, we get tags in the title metadata. Is that an error ? E.G. :

Api crossref org/v1/works/10.1002/ajpa.24488

“title”: [“A population history of indigenous\n Bahamian\n islanders: Insights from ancient\n DNA”],

Is there a way to get only raw text with the CrossRef API ?


Hi Fred,

It’s not especially common, but it is allowed.

We support certain face markup within the metadata that publishers supply for their registered content.

It’s up to each publisher when and whether they opt to supply those markup tags.

There’s not a way to query the API such that you’ll get back only the text, without tags. You’d have to clean up the data after the fact to strip them out, if that’s what you wanted.


1 Like

Hi Shayn,

I had the same question. Thanks for the link and confirmation. I wonder if a full specification is available of what kind of face markup is permitted.

For example. a bibtex query for the paper with DOI 10.1002/2015gl067329 gives me the title:

	title = {An automatically updated
		            -wave model of the upper mantle and the depth extent of azimuthal anisotropy},

Notice that in order to process this I would need to first decode the latex, and then decode the html tags, in that specific order. The face markup docs only mentions html entities and MathML, not arbitrary LaTeX on top of that. I wonder if the face markup could be better constrained.

Could a specification for the permitted face markup perhaps even be used to implement content negotiation in a way that “application/x-bibtex” queries would always return metadata in LaTeX? I suppose this would require a translation layer between the html (?) based face markup and an equivalent LaTeX representation. I appreciate that this would not be trivial, but it would greatly improve the quality of automated bibliography generation.

Hi, and thanks for your feedback

The permitted markup, and the metadata elements where it’s allowed, are described in our documentation.

It’s relatively minimal, just bold (b), italic (i), underline (u), over-line (ovl), superscript (sup), subscript (sub), small caps (scp), and typewriter text (tt). And they can only occur in titles and citations.

I’m not sure, but I can pass the suggestion along to our technical team and the API product manager. Content negotiation is an especially complicated tool to make updates or improvements too, because it’s a collaboration between three organizations.