Slixfeed/slixfeed/fetch.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""

FIXME

1) feed_mode_scan doesn't find feed for https://www.blender.org/
   even though it should be according to the pathnames dictionary.

TODO

1) Support Gemini and Gopher.

2) Check also for HTML, not only feed.bozo.

3) Add "if utility.is_feed(url, feed)" to view_entry and view_feed

4) Replace sqlite.remove_nonexistent_entries by sqlite.check_entry_exist
   Same check, just reverse.

5) Support protocol Gopher
   See project /michael-lazar/pygopherd
   See project /gopherball/gb

6) Support ActivityPub @person@domain (see Tip Of The Day).
    
7) See project /offpunk/offblocklist.py

"""

from aiohttp import ClientError, ClientSession, ClientTimeout
from asyncio import TimeoutError
# from asyncio.exceptions import IncompleteReadError
# from bs4 import BeautifulSoup
# from http.client import IncompleteRead
import logging
# from lxml import html
# from xml.etree.ElementTree import ElementTree, ParseError
import requests
import slixfeed.config as config
try:
    from magnet2torrent import Magnet2Torrent, FailedToFetchException
except:
    logging.info(
        "Package magnet2torrent was not found.\n"
        "BitTorrent is disabled.")


# class FetchDat:
# async def dat():

# class FetchFtp:
# async def ftp():

# class FetchGemini:
# async def gemini():

# class FetchGopher:
# async def gopher():

# class FetchHttp:
# async def http():

# class FetchIpfs:
# async def ipfs():


def http_response(url):
    """
    Download response headers.

    Parameters
    ----------
    url : str
        URL.

    Returns
    -------
    response: requests.models.Response
        HTTP Header Response.

    Result would contain these:
        response.encoding
        response.headers
        response.history
        response.reason
        response.status_code
        response.url
    """
    user_agent = (
        config.get_value(
            "settings", "Network", "user_agent")
        ) or 'Slixfeed/0.1'
    headers = {
        "User-Agent": user_agent
    }
    try:
        # Don't use HEAD request because quite a few websites may deny it
        # response = requests.head(url, headers=headers, allow_redirects=True)
        response = requests.get(url, headers=headers, allow_redirects=True)
    except Exception as e:
        logging.error(str(e))
        response = None
    return response


async def http(url):
    """
    Download content of given URL.

    Parameters
    ----------
    url : list
        URL.

    Returns
    -------
    msg: list or str
        Document or error message.
    """
    user_agent = (config.get_values('settings.toml', 'network')['user_agent']
                  or 'Slixfeed/0.1')
    headers = {'User-Agent': user_agent}
    proxy = (config.get_values('settings.toml', 'network')['http_proxy']
             or '')
    timeout = ClientTimeout(total=10)
    async with ClientSession(headers=headers) as session:
    # async with ClientSession(trust_env=True) as session:
        try:
            async with session.get(url, proxy=proxy,
                                   # proxy_auth=(proxy_username, proxy_password),
                                   timeout=timeout
                                   ) as response:
                status = response.status
                if status == 200:
                    try:
                        document = await response.text()
                        result = {'charset': response.charset,
                                  'content': document,
                                  'content_length': response.content_length,
                                  'content_type': response.content_type,
                                  'error': False,
                                  'message': None,
                                  'original_url': url,
                                  'status_code': status,
                                  'response_url': response.url}
                    except:
                        result = {'error': True,
                                  'message': 'Could not get document.',
                                  'original_url': url,
                                  'status_code': status,
                                  'response_url': response.url}
                else:
                    result = {'error': True,
                              'message': 'HTTP Error:' + str(status),
                              'original_url': url,
                              'status_code': status,
                              'response_url': response.url}
        except ClientError as e:
            result = {'error': True,
                      'message': 'Error:' + str(e),
                      'original_url': url,
                      'status_code': None}
        except TimeoutError as e:
            result = {'error': True,
                      'message': 'Timeout:' + str(e),
                      'original_url': url,
                      'status_code': None}
        except Exception as e:
            logging.error(e)
            result = {'error': True,
                      'message': 'Error:' + str(e),
                      'original_url': url,
                      'status_code': None}
    return result


async def magnet(link):
    m2t = Magnet2Torrent(link)
    try:
        filename, torrent_data = await m2t.retrieve_torrent()
    except FailedToFetchException:
        logging.debug("Failed")
Split main.py into modules 2023-10-24 16:43:14 +02:00			`#!/usr/bin/env python3`
			`# -- coding: utf-8 --`

Add feeds, mionr improvements and notes 2023-11-22 12:47:34 +01:00			`"""`

			`FIXME`

			`1) feed_mode_scan doesn't find feed for https://www.blender.org/`
			`even though it should be according to the pathnames dictionary.`

Fix tasks. Listen carefully to Laura. 2023-11-23 17:55:36 +01:00			`TODO`

			`1) Support Gemini and Gopher.`

Add proxy services. Merry Christmas to one and all! 2023-12-26 12:22:45 +01:00			`2) Check also for HTML, not only feed.bozo.`

WIP Add http proxy support. Add more functionality to handle bookmarks. Split into more modules. Remove callback function initdb. Tasked status messages are broken. 2024-01-02 12:42:41 +01:00			`3) Add "if utility.is_feed(url, feed)" to view_entry and view_feed`
Add ClearURLs functionality. Fix Proxy functionality (remove www). 2023-12-27 23:48:31 +01:00
Fix statistics 2024-01-14 19:05:12 +01:00			`4) Replace sqlite.remove_nonexistent_entries by sqlite.check_entry_exist`
Split more functions into smaller functions 2024-01-02 19:11:36 +01:00			`Same check, just reverse.`

WIP: Closer to fix double message. See task.py 2024-02-10 18:53:53 +01:00			`5) Support protocol Gopher`
			`See project /michael-lazar/pygopherd`
			`See project /gopherball/gb`

			`6) Support ActivityPub @person@domain (see Tip Of The Day).`

			`7) See project /offpunk/offblocklist.py`

Add feeds, mionr improvements and notes 2023-11-22 12:47:34 +01:00			`"""`

Disable activation token and mastership mechanism 2023-12-05 09:18:29 +01:00			`from aiohttp import ClientError, ClientSession, ClientTimeout`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`from asyncio import TimeoutError`
Restructure modules and database. Add OPML import functionality. Minor improvements. 2024-01-06 23:03:08 +01:00			`# from asyncio.exceptions import IncompleteReadError`
			`# from bs4 import BeautifulSoup`
			`# from http.client import IncompleteRead`
Save enclosures Send new message upon media detection 2024-01-13 18:17:43 +01:00			`import logging`
Restructure modules and database. Add OPML import functionality. Minor improvements. 2024-01-06 23:03:08 +01:00			`# from lxml import html`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`# from xml.etree.ElementTree import ElementTree, ParseError`
Fix many issues amidst change of table structure 2024-02-04 18:08:12 +01:00			`import requests`
Save enclosures Send new message upon media detection 2024-01-13 18:17:43 +01:00			`import slixfeed.config as config`
			`try:`
			`from magnet2torrent import Magnet2Torrent, FailedToFetchException`
			`except:`
			`logging.info(`
			`"Package magnet2torrent was not found.\n"`
			`"BitTorrent is disabled.")`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00
Add preview commands (read and select) and experimenting with XEP-0249 2023-11-26 06:48:09 +01:00
WIP: Closer to fix double message. See task.py 2024-02-10 18:53:53 +01:00			`# class FetchDat:`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def dat():`
Add preview commands (read and select) and experimenting with XEP-0249 2023-11-26 06:48:09 +01:00
WIP: Closer to fix double message. See task.py 2024-02-10 18:53:53 +01:00			`# class FetchFtp:`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def ftp():`
WIP: Closer to fix double message. See task.py 2024-02-10 18:53:53 +01:00
			`# class FetchGemini:`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def gemini():`
Split main.py into modules 2023-10-24 16:43:14 +02:00
WIP: Closer to fix double message. See task.py 2024-02-10 18:53:53 +01:00			`# class FetchGopher:`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def gopher():`
Split main.py into modules 2023-10-24 16:43:14 +02:00
WIP: Closer to fix double message. See task.py 2024-02-10 18:53:53 +01:00			`# class FetchHttp:`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def http():`
Add preview commands (read and select) and experimenting with XEP-0249 2023-11-26 06:48:09 +01:00
WIP: Closer to fix double message. See task.py 2024-02-10 18:53:53 +01:00			`# class FetchIpfs:`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def ipfs():`
Add preview commands (read and select) and experimenting with XEP-0249 2023-11-26 06:48:09 +01:00
WIP: Closer to fix double message. See task.py 2024-02-10 18:53:53 +01:00
Fix many issues amidst change of table structure 2024-02-04 18:08:12 +01:00			`def http_response(url):`
			`"""`
			`Download response headers.`

			`Parameters`
			`----------`
			`url : str`
			`URL.`

			`Returns`
			`-------`
			`response: requests.models.Response`
			`HTTP Header Response.`

			`Result would contain these:`
			`response.encoding`
			`response.headers`
			`response.history`
			`response.reason`
			`response.status_code`
			`response.url`
			`"""`
			`user_agent = (`
			`config.get_value(`
Settings: Manage several class instance objects. 2024-03-07 07:56:11 +01:00			`"settings", "Network", "user_agent")`
Fix many issues amidst change of table structure 2024-02-04 18:08:12 +01:00			`) or 'Slixfeed/0.1'`
			`headers = {`
			`"User-Agent": user_agent`
			`}`
			`try:`
			`# Don't use HEAD request because quite a few websites may deny it`
			`# response = requests.head(url, headers=headers, allow_redirects=True)`
			`response = requests.get(url, headers=headers, allow_redirects=True)`
			`except Exception as e:`
			`logging.error(str(e))`
			`response = None`
			`return response`

Fix keywords extracted from sqlite. Improve modiles fetch and crawl. Add form featured feeds. Add form roster manager. Add form subscibers manager. WIP 2024-02-18 00:21:44 +01:00
Detect image from xml enclosure in addition to html img 2024-01-11 11:55:42 +01:00			`async def http(url):`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`"""`
			`Download content of given URL.`
Update datahandler.py 2023-11-02 06:17:04 +01:00
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`Parameters`
			`----------`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`url : list`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`URL.`

			`Returns`
			`-------`
			`msg: list or str`
			`Document or error message.`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`"""`
Replace configuration file INI by TOML. Fix ping functionality when activated as component (thank you Guus and MattJ). Add initial code for XEP-0060: Publish-Subscribe. Fix case-sensitivity with setting keys sent in-chat-command (Thank you mirux) 2024-03-12 18:13:01 +01:00			`user_agent = (config.get_values('settings.toml', 'network')['user_agent']`
Fix keywords extracted from sqlite. Improve modiles fetch and crawl. Add form featured feeds. Add form roster manager. Add form subscibers manager. WIP 2024-02-18 00:21:44 +01:00			`or 'Slixfeed/0.1')`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`headers = {'User-Agent': user_agent}`
Replace configuration file INI by TOML. Fix ping functionality when activated as component (thank you Guus and MattJ). Add initial code for XEP-0060: Publish-Subscribe. Fix case-sensitivity with setting keys sent in-chat-command (Thank you mirux) 2024-03-12 18:13:01 +01:00			`proxy = (config.get_values('settings.toml', 'network')['http_proxy']`
			`or '')`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`timeout = ClientTimeout(total=10)`
Add user agent setting. Add command reset (mark as read). Fix error with command recent. Fix error with command stats. Thanks roughnecks for reporting these issues. 2023-12-18 16:29:32 +01:00			`async with ClientSession(headers=headers) as session:`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`# async with ClientSession(trust_env=True) as session:`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`try:`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`async with session.get(url, proxy=proxy,`
			`# proxy_auth=(proxy_username, proxy_password),`
			`timeout=timeout`
			`) as response:`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`status = response.status`
Fix keywords extracted from sqlite. Improve modiles fetch and crawl. Add form featured feeds. Add form roster manager. Add form subscibers manager. WIP 2024-02-18 00:21:44 +01:00			`if status == 200:`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`try:`
Fix keywords extracted from sqlite. Improve modiles fetch and crawl. Add form featured feeds. Add form roster manager. Add form subscibers manager. WIP 2024-02-18 00:21:44 +01:00			`document = await response.text()`
			`result = {'charset': response.charset,`
			`'content': document,`
			`'content_length': response.content_length,`
			`'content_type': response.content_type,`
			`'error': False,`
			`'message': None,`
			`'original_url': url,`
			`'status_code': status,`
			`'response_url': response.url}`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`except:`
Fix keywords extracted from sqlite. Improve modiles fetch and crawl. Add form featured feeds. Add form roster manager. Add form subscibers manager. WIP 2024-02-18 00:21:44 +01:00			`result = {'error': True,`
			`'message': 'Could not get document.',`
			`'original_url': url,`
			`'status_code': status,`
			`'response_url': response.url}`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`else:`
Fix keywords extracted from sqlite. Improve modiles fetch and crawl. Add form featured feeds. Add form roster manager. Add form subscibers manager. WIP 2024-02-18 00:21:44 +01:00			`result = {'error': True,`
			`'message': 'HTTP Error:' + str(status),`
			`'original_url': url,`
			`'status_code': status,`
			`'response_url': response.url}`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`except ClientError as e:`
Fix keywords extracted from sqlite. Improve modiles fetch and crawl. Add form featured feeds. Add form roster manager. Add form subscibers manager. WIP 2024-02-18 00:21:44 +01:00			`result = {'error': True,`
			`'message': 'Error:' + str(e),`
Reload settings on runtime. Fix several issues. 2024-03-07 20:06:31 +01:00			`'original_url': url,`
			`'status_code': None}`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`except TimeoutError as e:`
Fix keywords extracted from sqlite. Improve modiles fetch and crawl. Add form featured feeds. Add form roster manager. Add form subscibers manager. WIP 2024-02-18 00:21:44 +01:00			`result = {'error': True,`
			`'message': 'Timeout:' + str(e),`
Reload settings on runtime. Fix several issues. 2024-03-07 20:06:31 +01:00			`'original_url': url,`
			`'status_code': None}`
Message: * Remove HTML tags from titles. Fetch: * Handle all exceptions. SQLite: * Add more functions to handle new ad-hoc commands. Ad-Hoc: * Move all commands into a single module called command. * Remove commands from client and component. Documentation: * Comment commands that are not available. 2024-02-29 18:08:53 +01:00			`except Exception as e:`
			`logging.error(e)`
			`result = {'error': True,`
			`'message': 'Error:' + str(e),`
Reload settings on runtime. Fix several issues. 2024-03-07 20:06:31 +01:00			`'original_url': url,`
			`'status_code': None}`
Fix keywords extracted from sqlite. Improve modiles fetch and crawl. Add form featured feeds. Add form roster manager. Add form subscibers manager. WIP 2024-02-18 00:21:44 +01:00			`return result`
Save enclosures Send new message upon media detection 2024-01-13 18:17:43 +01:00

			`async def magnet(link):`
			`m2t = Magnet2Torrent(link)`
			`try:`
			`filename, torrent_data = await m2t.retrieve_torrent()`
			`except FailedToFetchException:`
			`logging.debug("Failed")`