Thanks to a tender by The Document Foundation, we have been able to implement a new and much improved curl-based HTTP/WebDAV UCP for LibreOffice 7.3.
This is the technology that sits under almost every public network file access LibreOffice can do, like loading and saving from a NextCloud instance, or working on a file sitting on a SharePoint server.
Please read on for how we implemented that.
What is a UCP?
It is a component that is part of the Universal Content Broker framework, a sort of virtual filesystem abstraction. Here files (or “contents”) are identified by URLs, and there is a provider for every supported URL scheme which is called a Universal Content Provider (UCP). So the UCP is responsible for providing access to contents by a particular protocol, for example, local filesystem, HTTP or FTP.
HTTP is the ubiquitous transport protocol that is used by the web, and WebDAV (Distributed Authoring and Versioning) is layered on top of HTTP with additional methods to provide features such as stored properties and locking.
What was the original problem with the HTTP UCP?
Due to its storied history, LibreOffice had some issues with the UCP that was used for the HTTP and WebDAV protocols. There were actually 2 such UCP implementations in the git repository, one inherited from OpenOffice.org based on the neon library, and another imported from the Apache OpenOffice fork based on the Apache Serf library.
But both of these had serious issues:
On the one hand, the neon based HTTP/WebDAV UCP was nice, in that it was very mature code and a lot of effort had gone into making it work robustly. The neon library implements both handling of the low-level HTTP transfer, as well as the higher-level WebDAV aspects such as parsing various XML formats. The licensing though was causing issues for anything the project wanted to put on Apple’s App Store.
On the other hand, the Apache Serf based HTTP/WebDAV UCP is rather poorly maintained and doesn’t even work since around 2016, when some LibreOffice framework code was changed. The Serf library requires 2 other Apache libraries to build, and it could not easily be updated from the old release because the new release required its own build tool (eventually this problem was worked around by writing LibreOffice-specific makefiles for Serf). Serf though only implements HTTP transfers, so the UCP had to implement WebDAV specific functionality on top. On the plus side, the Serf UCP is licensed such that Apple has no issues with it.
Both of the UCPs have the drawback that they require OpenSSL for encrypted connections. That is far from ideal, as that results in LibreOffice shipping hard-coded trusted CA certificates that are bundled with OpenSSL, and the user doesn’t have any way to influence that.
Thus the idea was born to replace both of the existing HTTP/WebDAV UCPs with a new one, based on libcurl. Why did we pick this library?
- it can use the operating system’s TLS stack on Windows and macOS and also NSS on Linux, and hence the user will be able to manage CA trust via the operating system’s user interface.
- can be shipped in Apple’s App Store without licensing issues.
- is already used by the UCP for FTP, the UCP for CMIS, and the update check, so no extra weight gets added to LibreOffice installations
What does the HTTP UCP look like?
To a first approximation, there are 3 parts involved in the UCP:
- The upper layer implements the UNO API which is called by LibreOffice, and translates the calls from generic sequence-of-any stringly typed abstractness into HTTP or WebDAV protocol calls, and does some high level protocol handling to figure out what the server supports and so on. This is is independent of the low-level library.
- Then there is the lower layer of the UCP, which translates the generic HTTP or WebDAV protocol calls to something that the particular third-party library can understand, hook up its callbacks for data transfer and authentication, and parse the reply XML documents.
- At the bottom, there is the third-party library that implements the HTTP protocol.
A little more detailed, the most important classes are:
- ContentProvider: the UNO entry point/factory, creates Content instances
- Content: the main UNO service, translates the UCP API to WebDAV methods, one instance per URL
- DAVResourceAccess: sits between Content and CurlSession
- DAVSessionFactory: creates CurlSession for DAVResourceAccess
- DAVAuthListener_Impl: request credentials from UI via UNO
- CurlSession: low-level interfacing with libcurl
- SerfLockStore: singleton used by CurlSession to store DAV lock tokens, runs a thread to refresh locks when they expire
- WebDAVResponseParser: parse XML responses to LOCK, PROPFIND requests
Most of these classes started as copies from the Serf UCP, except CurlSession and CurlUri which are entirely new.
How did we implement the curl based UCP?
First, we started by copying the upper-level part of the Serf UCP. Next, because this code had not really been maintained in years, we had to fix a lot of static analysis warnings from LO’s clang plugin. Then we wrote the low-level part that interfaces with libcurl. At that point, it was possible to fetch a document, but storing still failed, because of the lack of maintenance of the Serf UCP.
So next we had to go through all of the changes in the neon UCP that happened since the fork, and cherry-pick the relevant ones to the curl UCP. This was about 450 commits, many more than initially expected, and after checking they were covered by licensing statements from the authors we cherry-picked about 85 commits, many of which were written by Giuseppe Castagno (who as it happens is now #1 author of commits to webdav-curl – thanks Giuseppe for all this excellent work!).
In some cases the commits conflicted with other changes in the Serf UCP, sometimes it was faster to re-implement them, than solving the conflicts. Many commits were also trivial in nature, so we skipped them; there may be some style-cleanup work to do still.
We also decided to omit some features of dubious value, particularly popping up a dialog to ask they user whether they want to accept an invalid server certificate seems not appropriate anymore. Instead, since curl uses the system certificate store, power users or system administrators can now use the OS builtin mechanisms to roll out organisation-private or self-signed certificates.
After that initial implementation work, we tested it with some popular WebDAV server.
First, running the curl UCP against Apache httpd with mod_dav found some bugs in the new code, and we found the server to behave very reasonably.
Next we tried NextCloud, and this uncovered a few new bugs, but the server also had a nasty surprise: storing a file with chunked transport encoding resulted in silent data loss.
Finally we tried SharePoint 2016, and this required quite a few changes in the code that were utterly surprising. SharePoint’s implementation of the WebDAV protocol appears to be very peculiar, with lots of special-casing needed.
The status now
The new curl based UCP is shipping with the new LO 7.3 release, and the code of the 2 previous UCPs and their external library dependencies have been deleted and cannot be build in LO 7.4. This has removed a net 17700 lines of code.
A few more bugs were found by LibreOffice community members & QA, and subsequently fixed.
Some users claim that we’re now faster than the neon UCP, although to be honest it is unclear why that would be the case: tdf#42742
And it turns out that there are servers out there that just don’t want to talk to curl, which is rather sad: tdf#146460
In closing, we would like to thank The Document Foundation, and their many donors, for sponsoring this project!
5 thoughts on “Improving LibreOffice’s network file access”