Source code for okcupyd.util.fetchable

"""
Most of the collection objects that are returned from function
invocations in the okcupyd library are instances of
:class:`~okcupyd.util.fetchable.Fetchable`. In most cases, it is fine
to treat these objects as though they are lists because they can be iterated
over, sliced and accessed by index, just like lists:

.. code:: python

   for question in user.profile.questions:
       print(question.answer.text)

   a_random_question = user.profile.questions[2]
   for question in questions[2:4]:
       print(question.answer_options[0])

However, in some cases, it is important to be aware of the subtle
differences between :class:`~okcupyd.util.fetchable.Fetchable` objects
and python lists.
:class:`~okcupyd.util.fetchable.Fetchable` construct the elements that
they "contain" lazily. In most of its uses in the okcupyd library,
this means that http requests can be made to populate
:class:`~okcupyd.util.fetchable.Fetchable` instances as its elments
are requested.

The :attr:`~okcupyd.profile.Profile.questions`
:class:`~okcupyd.util.fetchable.Fetchable` that is used in the example
above fetches the pages that are used to construct its contents in
batches of 10 questions. This means that the actual call to retrieve
data is made when iteration starts. If you enable the request logger
when you run this code snippet, you get output that illustrates this
fact:

.. code::

   2014-10-29 04:25:04 Livien-MacbookAir requests.packages.urllib3.connectionpool[82461] DEBUG "GET /profile/ShrewdDrew/questions?leanmode=1&low=11 HTTP/1.1" 200 None
    Yes
    Yes
    Kiss someone.
    Yes.
    Yes
    Sex.
    Both equally
    No, I wouldn't give it as a gift.
    Maybe, I want to know all the important stuff.
    Once or twice a week
    2014-10-29 04:25:04 Livien-MacbookAir requests.packages.urllib3.connectionpool[82461] DEBUG "GET /profile/ShrewdDrew/questions?leanmode=1&low=21 HTTP/1.1" 200 None
    No.
    No
    No
    Yes
    Rarely / never
    Always.
    Discovering your shared interests
    The sun
    Acceptable.
    No.

Some fetchables will continue fetching content for quite a long time.
The search fetchable, for example, will fetch content until okcupid runs
out of search results. As such, things like:

.. code:: python

    for profile in user.search():
        profile.message("hey!")

should be avoided, as they are likely to generate a massive number of requests
to okcupid.com.


Another subtlety of the :class:`~okcupyd.util.fetchable.Fetchable`
class is that its instances cache its contained results. This means that
the second iteration over :attr:`okcupyd.profile.Profile.questions` in the
example below does not result in any http requests:

.. code:: python

    for question in user.profile.questions:
        print(question.text)

    for question in user.profile.questions:
        print(question.answer)

It is important to understand that this means that the contents of a
:class:`~okcupyd.util.fetchable.Fetchable` are not guarenteed to be in
sync with okcupid.com the second time they are requested. Calling
:meth:`~okcupyd.util.fetchable.Fetchable.refresh` will cause the
:class:`~okcupyd.util.fetchable.Fetchable` to request new data from
okcupid.com when its contents are requested. The code snippet that
follows prints out all the questions that the logged in user has
answered roughly once per hour, including ones that are answered while
the program is running.


.. code:: python

    import time

    while True:
        for question in user.profile.questions:
            print(question.text)
        user.profile.questions.refresh()
        time.sleep(3600)

Without the call to  `user.profile.questions.refresh()`, this program
would never update the user.profile.questions instance, and thus what
would be printed to the screen with each iteration of the for loop.
"""
import itertools
from lxml import html


[docs]class Fetchable(object):
    """List-like container object that lazily loads its contained items."""

    @classmethod
[docs]    def fetch_marshall(cls, fetcher, processor):
        return cls(FetchMarshall(fetcher, processor))

[docs]    def __init__(self, fetcher, **kwargs):
        """
        :param fetcher: An object with a `fetch` generator method that retrieves
                        items for the fetchable.
        :param nice_repr: Append the repr of a list containing the items that
                          have been fetched to this point by the fetcher. Defaults
                          to True
        :param kwargs: Arguments that should be passed to the fetcher when it's
                       fetch method is called. These are stored on the fetchable
                       so they can be passed to the fetcher whenever
                       :meth:`refresh` is called.
        """
        self._fetcher = fetcher
        self._kwargs = kwargs
        self.refresh()

[docs]    def refresh(self, nice_repr=True, **kwargs):
        """
        :param nice_repr: Append the repr of a list containing the items that
                          have been fetched to this point by the fetcher.
        :type nice_repr: bool
        :param kwargs: kwargs that should be passed to the fetcher when its
                       fetch method is called. These are merged with the values
                       provided to the constructor, with the ones provided here
                       taking precedence if there is a conflict.
        """
        for key, value in self._kwargs.items():
            kwargs.setdefault(key, value)
        # No real good reason to hold on to this. DONT TOUCH.
        self._original_iterable = self._fetcher.fetch(**kwargs)
        self.exhausted = False
        if nice_repr:
            self._accumulated = []
            self._original_iterable = self._make_nice_repr_iterator(
                self._original_iterable, self._accumulated
            )
        else:
            self._accumulated = None
        self._clonable, = itertools.tee(self._original_iterable, 1)
        return self

    @staticmethod
    def _make_nice_repr_iterator(original_iterable, accumulator):
        for item in original_iterable:
            accumulator.append(item)
            yield item

    __call__ = refresh

    def __iter__(self):
        # This is hard to think about, but you can't ever use an iterator
        # if you plan on cloning it. Furthermore, you can only clone it once
        # (directly). For this reason, we throw away the iterator once it has
        # been cloned.
        new_iterable, self._clonable = itertools.tee(self._clonable, 2)
        return new_iterable

    def __getitem__(self, item):
        if not isinstance(item, slice):
            iterator = iter(self)
            assert isinstance(item, int)
            if item < 0:
                return list(iterator)[item]
            try:
                for i in range(item):
                    next(iterator)
                return next(iterator)
            except StopIteration:
                self.exhausted = True
                raise IndexError("The Fetchable does not have a value at the "
                                 "index that was provided.")
        return self._handle_slice(item)

    def _handle_slice(self, item):
        iterator = iter(self)
        if item.start is None and item.stop is None:
            # No point in being lazy if they want it all.
            self.exhausted = True
            return list(iterator)[item]
        if ((item.start and item.start < 0) or
            (not item.stop or item.stop < 0)):
            # If we have any negative numbers we have to expand the whole
            # thing anyway. This is also the case if there is no bound
            # on the slice, hence the `not item.stop` trigger.
            self.exausted = True
            return list(iterator)[item]
        accumulator = []
        # No need to do this for stop since we are sure it is not None.
        start = item.start or 0
        for _ in range(start):
            try:
                next(iterator)
            except StopIteration: # This is strange but its what list do.
                self.exhausted = True
                break
        for i in range(item.stop - start):
            try:
                value = next(iterator)
            except StopIteration:
                self.exhausted = True
                break
            else:
                if item.step == None or i % item.step == 0:
                    accumulator.append(value)
        return accumulator

    def __repr__(self):
        fetched_type = repr(self._fetcher)
        if self._accumulated == None:
            list_repr = ''
        else:
            try:
                self[0]
            except:
                pass
            else:
                fetched_type = type(self._accumulated[0]).__name__
            list_repr = repr(self._accumulated)
            if not self.exhausted:
                if len(self._accumulated) == 0:
                    list_repr = '[...]'
                else:
                    list_repr = '{0}, ...]'.format(list_repr[:-1])

        return '<{0}[{1}]{2}>'.format(type(self).__name__,
                                    fetched_type, list_repr)

    def __len__(self):
        return len(self[:])

    def __add__(self, other):
        return self[:] + other[:]

    def __eq__(self, other):
        return self[:] == other[:]

    def __nonzero__(self):
        try:
            self[0]
        except IndexError:
            return False
        else:
            return True


[docs]class FetchMarshall(object):

    def __init__(self, fetcher, processor, terminator=None, start_at=1):
        self._fetcher = fetcher
        self._start_at = start_at
        self._processor = processor
        self._terminator = terminator or self.simple_decider

    @staticmethod
[docs]    def simple_decider(pos, last, text_response):
        return pos > last

[docs]    def fetch(self, start_at=None):
        pos = start_at or self._start_at
        while True:
            last = pos
            text_response = self._fetcher.fetch(start_at=pos)
            if not text_response: break
            for item in self._processor.process(text_response):
                if item is StopIteration:
                    raise StopIteration()
                yield item
                pos += 1
            if not self._terminator(pos, last, text_response):
                break

    def __repr__(self):
        return '{0}({1}, {2})'.format(type(self).__name__,
                                      repr(self._fetcher),
                                      repr(self._processor))


[docs]class SimpleProcessor(object):
    """Applies object_factory to each element found with element_xpath

    Accepts session merely to be consistent with the FetchMarshall interface.
    """

    def __init__(self, session, object_factory, element_xpath):
        self._object_factory = object_factory
        self._element_xpath = element_xpath

[docs]    def process(self, text_response):
        if not text_response.strip():
            yield StopIteration
            raise StopIteration()
        for element in self._element_xpath.apply_(
            html.fromstring(text_response)
        ):
            yield self._object_factory(element)

    def __repr__(self):
        return '<{0}({1}, {2})>'.format(type(self).__name__,
                                        repr(self._object_factory),
                                        repr(self._element_xpath))


[docs]class PaginationProcessor(object):

    def __init__(self, object_factory, element_xpb,
                 current_page_xpb, total_page_xpb):
        self._object_factory = object_factory
        self._element_xpb = element_xpb
        self._current_page_xpb = current_page_xpb
        self._total_page_xpb = total_page_xpb

    def _current_page(self, tree):
        return int(self._current_page_xpb.one_(tree))

    def _page_count(self, tree):
        return int(self._total_page_xpb.one_(tree))

    def _are_pages_left(self, tree):
        return self._current_page(tree) < self._page_count(tree)

[docs]    def process(self, text_response):
        tree = html.fromstring(text_response)
        for element in self._element_xpb.apply_(tree):
            yield self._object_factory(element)
        if not self._are_pages_left(tree):
            # This is pretty gross: Part of the processor protocol
            # is that if StopIteration is yielded, the loop above
            # will be terminated. No easy way around this short
            # of making bigger objects or abstracting less.
            yield StopIteration


[docs]class GETFetcher(object):

    def __init__(self, session, path, query_param_builder=lambda: {}):
        self._session = session
        self._path = path
        self._query_param_builder = query_param_builder

[docs]    def fetch(self, *args, **kwargs):
        response = self._session.okc_get(
            self._path, params=self._query_param_builder(*args, **kwargs)
        )
        return response.content.strip()

    def __repr__(self):
        return '<{0}("{1}")>'.format(type(self).__name__, self._path)