libcurl is never finished and is not just an off-the-shelf product. It is a living project that is improved and modified on almost a daily basis. We depend on skilled and interested hackers to fix bugs and to add features.
This chapter is meant to describe internal details to aid keen libcurl hackers to learn some basic concepts on how libcurl works internally and thus possibly where to look for problems or where to add things when you want to make the library do something new.
When reading the source code there are some useful basics that are good to know and keep in mind:
'data' is the variable name
we use all over to refer to the easy
handle (struct Curl_easy
) for the transfer being worked on.
No other name should be used for this
and nothing else should use this
name.
conn
is the variable name we use all over
the internals to refer to the current connection
the code works on (struct connectdata
). A transfer typically uses a
connection at some point and typically
only one at a time. There's a conn->data
pointer that identifies the transfer
that is currently working on this
connection. A single connection can be
reused over time by several transfers
(and thus easy handles) and a single
connection can also be used by several
easy handles simultaneously when
multiplexed connections are used. When
muliplexing are used, the conn->data
pointer has to be updated accordingly
quite frequently.
result
is the usual name we use for a CURLcode
variable to hold the return values
from functions and if that return
value is different than zero, it is an
error and the function should clean up
and return (usually passing on the
same error code to its parent
function).
libcurl offers a few different APIs to do transfers; where the primary differences are the synchronous easy interface versus the non-blocking multi interface. The multi interface itself can then be further used either by using the event-driven socket interface or the "normal" perform interface.
Internally however, everything is written for the event-driven interface. Everything needs to be written in non-blocking fashion so that functions are never waiting for data in loop or similar. Unless they are the "surface" functions that have that expressed functionality.
The function curl_easy_perform()
which performs a single transfer
synchronously, is itself just a wrapper
function that internally will setup and use
the multi interface itself.
To facilitate that non-blocking nature, the curl source is full of state machines. Work on as much data as there is and drive the state machine to where it can go based on what's available and allow the functions to continue from that point later on when more data arrives that then might drive the state machine further.
There are such states in many different levels for a given transfer and the code for each particular protocol may have its own set of state machines.
One of the primary states is the main
transfer "mode" the easy handle
holds, which says if the current transfer is
resolving, waiting for a resolve,
connecting, waiting for a connect, issuing a
request, doing a transfer etc (see the CURLMstate
enum in lib/multihandle.h
). Every transfer done with libcurl has an
associated easy handle and every easy handle
will exercise that state machine.
libcurl is a multi-protocol transfer library. The core of the code is a set of generic functions that are used for transfers in general and will mostly work the same for all protocols. The main state machine described above for example is there and works for all protocols - even though some protocols may not make use of all states for all transfers.
However, each different protocol libcurl
speaks also has its unique particularities
and specialties. In order to not have the
code littered with conditions in the style
"if the protocol is XYZ, then
do...", we instead have the concept of Curl_handler
. Each supported protocol defines one of
those in lib/url.c
there's an array of pointers to such
handlers called protocols[]
.
When a transfer is about to be done,
libcurl parses the URL it is about to
operate on and among other things it figures
out what protocol to use. Normally this can
be done by looking at the scheme part of the
URL. For https://example.com
that is https
and for imaps://example.com
it is imaps
. Using the provided scheme, libcurl sets
the conn->handler
pointer to the handler struct for the
protocol that handles this URL.
The handler struct contains a set of function pointers that can be NULL or set to point to a protocol specific function to do things necessary for that protocol to work for a transfer. Things that not all other protocols need. The handler struct also sets up the name of the protocol and describes its feature set with a bitmask.
A libcurl transfer is built around a set of different "actions" and the handler can extend each of them. Here are some example function pointers in this struct and how they are used:
If a connection cannot be reused for a transfer, it needs to setup a connection to the host given in the URL and when it does, it can also call the protocol handler's function for it. Like this:
if(conn->handler->setup_connection)result = conn->handler->setup_connection(conn);
After a connection has been established, this function gets called
if(conn->handler->connect_it)result = conn->handler->connect_it(conn, &done);
"Do" is simply the action that issues a request for the particular resource the URL identifies. All protocol has a do action so this function must be provided:
result = conn->handler->do_it(conn, &done);
When a transfer is completed, the "done" action is taken:
result = conn->handler->done(conn);
The connection is about to be taken down.
result = conn->handler->disconnect(conn, dead_connection);