README.rst 2.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
  1. Goutte, a simple PHP Web Scraper
  2. ================================
  3. Goutte is a screen scraping and web crawling library for PHP.
  4. Goutte provides a nice API to crawl websites and extract data from the HTML/XML
  5. responses.
  6. Requirements
  7. ------------
  8. Goutte depends on PHP 5.5+ and Guzzle 6+.
  9. .. tip::
  10. If you need support for PHP 5.4 or Guzzle 4-5, use Goutte 2.x (latest `phar
  11. <https://github.com/FriendsOfPHP/Goutte/releases/download/v2.0.4/goutte-v2.0.4.phar>`_).
  12. If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.x (latest `phar
  13. <https://github.com/FriendsOfPHP/Goutte/releases/download/v1.0.7/goutte-v1.0.7.phar>`_).
  14. Installation
  15. ------------
  16. Add ``fabpot/goutte`` as a require dependency in your ``composer.json`` file:
  17. .. code-block:: bash
  18. composer require fabpot/goutte
  19. Usage
  20. -----
  21. Create a Goutte Client instance (which extends
  22. ``Symfony\Component\BrowserKit\Client``):
  23. .. code-block:: php
  24. use Goutte\Client;
  25. $client = new Client();
  26. Make requests with the ``request()`` method:
  27. .. code-block:: php
  28. // Go to the symfony.com website
  29. $crawler = $client->request('GET', 'http://www.symfony.com/blog/');
  30. The method returns a ``Crawler`` object
  31. (``Symfony\Component\DomCrawler\Crawler``).
  32. Fine-tune cURL options:
  33. .. code-block:: php
  34. $client->getClient()->setDefaultOption('config/curl/'.CURLOPT_TIMEOUT, 60);
  35. Click on links:
  36. .. code-block:: php
  37. // Click on the "Security Advisories" link
  38. $link = $crawler->selectLink('Security Advisories')->link();
  39. $crawler = $client->click($link);
  40. Extract data:
  41. .. code-block:: php
  42. // Get the latest post in this category and display the titles
  43. $crawler->filter('h2 > a')->each(function ($node) {
  44. print $node->text()."\n";
  45. });
  46. Submit forms:
  47. .. code-block:: php
  48. $crawler = $client->request('GET', 'http://github.com/');
  49. $crawler = $client->click($crawler->selectLink('Sign in')->link());
  50. $form = $crawler->selectButton('Sign in')->form();
  51. $crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
  52. $crawler->filter('.flash-error')->each(function ($node) {
  53. print $node->text()."\n";
  54. });
  55. More Information
  56. ----------------
  57. Read the documentation of the BrowserKit and `DomCrawler
  58. <http://symfony.com/doc/any/components/dom_crawler.html>`_ Symfony Components
  59. for more information about what you can do with Goutte.
  60. Pronunciation
  61. -------------
  62. Goutte is pronounced ``goot`` i.e. it rhymes with ``boot`` and not ``out``.
  63. Technical Information
  64. ---------------------
  65. Goutte is a thin wrapper around the following fine PHP libraries:
  66. * Symfony Components: BrowserKit, CssSelector and DomCrawler;
  67. * `Guzzle`_ HTTP Component.
  68. License
  69. -------
  70. Goutte is licensed under the MIT license.
  71. .. _`Composer`: http://getcomposer.org
  72. .. _`Guzzle`: http://docs.guzzlephp.org