Scripting browser-like tasks

                              curl can do almost every HTTP operation and
                                    transfer your favorite browser can. It can
                                    actually do a lot more than so as well, but
                                    in this chapter we will focus on the fact
                                    that you can use curl to reproduce, or
                                    script, what you would otherwise have to do
                                    manually with a browser.
                            
                              Here are some tricks and advice on how to
                                    proceed when doing this.
                            
                                Figure out what the browser does
                                      
                              This is really a necessary first step.
                                    Second-guessing what it does risks having
                                    you chase down the wrong problem rat-hole.
                                    The scientific approach to this problem
                                    pretty much requires that you first
                                    understand what the browser does.
                            
                              To learn what the browser does to perform a
                                    certain task, you can either read the HTML
                                    pages that you operate on and with a deep
                                    enough knowledge you can see what a browser
                                    would do to accomplish it and then start
                                    trying to do the same with curl.
                            
                              The slightly more effective way, that also
                                    works even for the cases when the page is
                                    shock-full of obfuscated JavaScript, is to
                                    run the browser and monitor what HTTP
                                    operations it performs.
                            
                              The
                                  Copy as curl
                                    section describes how you can record a
                                    browser's request and easily convert
                                    that to a curl command line.
                            
                              Those copied curl command lines are often
                                    not good enough though since they tend to
                                    copy exactly
                                    that request, while you probably want to be
                                    a bad bit more dynamic so that you can
                                    reproduce the same operation and not just
                                    resend the verbatim request.
                            
                                Cookies
                                      
                              A lot of the web today works with a user
                                    name and password login prompt somewhere. In
                                    many cases you even logged in a while ago
                                    with your browser but it has kept the state
                                    and keeps you logged in.
                            
                              The logged-in state is almost always done
                                    by using
                                  cookies. A common operation would be to first
                                    login and save the returned cookies in a
                                    file, and then let the site update the
                                    cookies in the subsequent command lines when
                                    you traverse the site with curl.
                            
                                Web logins and sessions
                                      
                              The site at
                                  https://example.com/
                                    features a login prompt. The login on the
                                    web site is a HTML form to which you send a
                                  HTTP POST
                                    to. Save the response cookies and the
                                    response (HTML) output.
                            
                              Although the login page is visible (if
                                    you'd use a browser) on
                                  https://example.com/, the HTML form tag on that page informs
                                    you about which exact URL to send the POST
                                    to, using the action
                                    parameter.
                            
                              In our imaginary case, the form tag looks
                                    like this:
                            
<form action="login.cgi" method="POST">
  <input type="text" name="user">
  <input type="password" name="secret">
  <input type="hidden" name="id" value="bc76">
</form>

                              There are three fields of importance. text, secret
                                    and id. The last one, the id, is marked hidden
                                    which means that it will not show up in the
                                    browser and it is not a field that a user
                                    fills in. It is generated by the site
                                    itself, and for your curl login to succeed,
                                    you need extract that value and use that in
                                    your POST submission together with the rest
                                    of the data.
                            
                              Send correct contents to the fields to the
                                    correct destination URL:
                            
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi -o out

                              Many login pages even send you a session
                                    cookie already when presenting the login,
                                    and since you often need to extract the
                                    hidden fields from the <form>
                                    tag anyway, you could do something like this
                                    first:
                            
curl -c cookies https://example.com/ -o loginform

                              You would often need a HTML parser or some
                                    scripting language to extract the id field
                                    from there and then you can proceed and
                                    login as mentioned above, but with the added
                                    cookie loading (I'm splitting the line
                                    into two lines to make it more
                                    readable):
                            
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi \
-b cookies -c cookies -o out

                              You can see that it uses both -b
                                    for reading cookies from the file and -c
                                    to store cookies again, for when the server
                                    sends back updated cookies.
                            
                              Always, always, add -v
                                    to the command lines when working out the
                                    details. See also the
                                  verbose
                                    section for more details on that.
                            
                                Redirects
                                      
                              It is common for servers to use
                                  redirects
                                    when responding to a login POST. It is so
                                    common I would probably say it is rare that
                                    it is not solved with a redirect.
                            
                              You then just need to remember that curl
                                    does not follow redirects automatically. You
                                    need to instruct it to do this by adding the -L
                                    command line option. Adding that to the
                                    previous command line then makes the full
                                    one look like:
                            
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi \
-b cookies -c cookies -L -o out

                                Post-login
                                      
                              In the above example command lines, we save
                                    the login response output in a file named
                                    'out' and in your script you
                                    should probably verify that it contains some
                                    text or something that confirms that the
                                    login is successful.
                            
                              Once successfully logged in, get the files
                                    or perform the HTTP operations you need and
                                    remember to keep using both -b
                                    and -c
                                    on the command lines to use and update the
                                    cookies.
                            
                                Referer
                                      
                              Some sites will check that the Referer:
                                    is actually identifying the legitimate
                                    parent URL when you request something or
                                    when you login or similar. You can then
                                    inform the server from which URL you arrived
                                    by using -e https://example.com/
                                    etc. Appending that to the previous login
                                    attempt then makes it:
                            
curl -d user=daniel -d secret=qwerty -d id=bc76 https://example.com/login.cgi \
-b cookies -c cookies -L -e "https://example.com/" -o out
HTTP cheat sheet
libcurl basics
Last updated 5 months ago