curl can do almost every HTTP operation and transfer your favorite browser can. It can actually do a lot more than so as well, but in this chapter we will focus on the fact that you can use curl to reproduce, or script, what you would otherwise have to do manually with a browser.
Here are some tricks and advice on how to proceed when doing this.
This is really a necessary first step. Second-guessing what it does risks having you chase down the wrong problem rat-hole. The scientific approach to this problem pretty much requires that you first understand what the browser does.
To learn what the browser does to perform a certain task, you can either read the HTML pages that you operate on and with a deep enough knowledge you can see what a browser would do to accomplish it and then start trying to do the same with curl.
The slightly more effective way, that also works even for the cases when the page is shock-full of obfuscated JavaScript, is to run the browser and monitor what HTTP operations it performs.
The Copy as curl section describes how you can record a browser's request and easily convert that to a curl command line.
Those copied curl command lines are often not good enough though since they tend to copy exactly that request, while you probably want to be a bad bit more dynamic so that you can reproduce the same operation and not just resend the verbatim request.
A lot of the web today works with a user name and password login prompt somewhere. In many cases you even logged in a while ago with your browser but it has kept the state and keeps you logged in.
The logged-in state is almost always done by using cookies. A common operation would be to first login and save the returned cookies in a file, and then let the site update the cookies in the subsequent command lines when you traverse the site with curl.
The site at https://example.com/ features a login prompt. The login on the web site is a HTML form to which you send a HTTP POST to. Save the response cookies and the response (HTML) output.
Although the login page is visible (if
you'd use a browser) on
https://example.com/, the HTML form tag on that page informs
you about which exact URL to send the POST
to, using the action
parameter.
In our imaginary case, the form tag looks like this:
<form action="login.cgi" method="POST"><input type="text" name="user"><input type="password" name="secret"><input type="hidden" name="id" value="bc76"></form>
There are three fields of importance. text, secret
and id. The last one, the id, is marked hidden
which means that it will not show up in the
browser and it is not a field that a user
fills in. It is generated by the site
itself, and for your curl login to succeed,
you need extract that value and use that in
your POST submission together with the rest
of the data.
Send correct contents to the fields to the correct destination URL:
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi -o out
Many login pages even send you a session
cookie already when presenting the login,
and since you often need to extract the
hidden fields from the <form>
tag anyway, you could do something like this
first:
curl -c cookies https://example.com/ -o loginform
You would often need a HTML parser or some scripting language to extract the id field from there and then you can proceed and login as mentioned above, but with the added cookie loading (I'm splitting the line into two lines to make it more readable):
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi \-b cookies -c cookies -o out
You can see that it uses both -b
for reading cookies from the file and -c
to store cookies again, for when the server
sends back updated cookies.
Always, always, add -v
to the command lines when working out the
details. See also the
verbose
section for more details on that.
It is common for servers to use redirects when responding to a login POST. It is so common I would probably say it is rare that it is not solved with a redirect.
You then just need to remember that curl
does not follow redirects automatically. You
need to instruct it to do this by adding the -L
command line option. Adding that to the
previous command line then makes the full
one look like:
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi \-b cookies -c cookies -L -o out
In the above example command lines, we save the login response output in a file named 'out' and in your script you should probably verify that it contains some text or something that confirms that the login is successful.
Once successfully logged in, get the files
or perform the HTTP operations you need and
remember to keep using both -b
and -c
on the command lines to use and update the
cookies.
Some sites will check that the Referer:
is actually identifying the legitimate
parent URL when you request something or
when you login or similar. You can then
inform the server from which URL you arrived
by using -e https://example.com/
etc. Appending that to the previous login
attempt then makes it:
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi \-b cookies -c cookies -L -e "https://example.com/" -o out