Slixfeed/slixfeed/fetch.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""

FIXME

1) feed_mode_scan doesn't find feed for https://www.blender.org/
   even though it should be according to the pathnames dictionary.

TODO

1) Support Gemini and Gopher.

2) Check also for HTML, not only feed.bozo.

3) Add "if utility.is_feed(url, feed)" to view_entry and view_feed

4) Refactor view_entry and view_feed - Why "if" twice?

5) Replace sqlite.remove_nonexistent_entries by sqlite.check_entry_exist
   Same check, just reverse.

"""

from aiohttp import ClientError, ClientSession, ClientTimeout
from asyncio import TimeoutError
from asyncio.exceptions import IncompleteReadError
from bs4 import BeautifulSoup
from email.utils import parseaddr
from feedparser import parse
from http.client import IncompleteRead
from lxml import html
import slixfeed.config as config
from slixfeed.datetime import now, rfc2822_to_iso8601
import slixfeed.sqlite as sqlite
from slixfeed.url import complete_url, join_url, trim_url
from urllib import error
# from xml.etree.ElementTree import ElementTree, ParseError
from urllib.parse import urlsplit, urlunsplit


# async def dat():

# async def ftp():
    
# async def gemini():

# async def gopher():

# async def http():

# async def ipfs():

async def download_feed(url):
    """
    Download content of given URL.

    Parameters
    ----------
    url : list
        URL.

    Returns
    -------
    msg: list or str
        Document or error message.
    """
    try:
        user_agent = config.get_value_default("settings", "Network", "user-agent")
    except:
        user_agent = "Slixfeed/0.1"
    if not len(user_agent):
        user_agent = "Slixfeed/0.1"
    headers = {'User-Agent': user_agent}
    url = url[0]
    proxy = (config.get_value("settings", "Network", "http_proxy")) or ''
    timeout = ClientTimeout(total=10)
    async with ClientSession(headers=headers) as session:
    # async with ClientSession(trust_env=True) as session:
        try:
            async with session.get(url, proxy=proxy,
                                   # proxy_auth=(proxy_username, proxy_password),
                                   timeout=timeout
                                   ) as response:
                status = response.status
                if response.status == 200:
                    try:
                        doc = await response.text()
                        # print (response.content_type)
                        msg = [doc, status]
                    except:
                        # msg = [
                        #     False,
                        #     ("The content of this document "
                        #      "doesn't appear to be textual."
                        #      )
                        #     ]
                        msg = [
                            False, "Document is too large or is not textual."
                            ]
                else:
                    msg = [
                        False, "HTTP Error: " + str(status)
                        ]
        except ClientError as e:
            # print('Error', str(e))
            msg = [
                False, "Error: " + str(e)
                ]
        except TimeoutError as e:
            # print('Timeout:', str(e))
            msg = [
                False, "Timeout: " + str(e)
                ]
    return msg
Split main.py into modules 2023-10-24 16:43:14 +02:00			`#!/usr/bin/env python3`
			`# -- coding: utf-8 --`

Add feeds, mionr improvements and notes 2023-11-22 12:47:34 +01:00			`"""`

			`FIXME`

			`1) feed_mode_scan doesn't find feed for https://www.blender.org/`
			`even though it should be according to the pathnames dictionary.`

Fix tasks. Listen carefully to Laura. 2023-11-23 17:55:36 +01:00			`TODO`

			`1) Support Gemini and Gopher.`

Add proxy services. Merry Christmas to one and all! 2023-12-26 12:22:45 +01:00			`2) Check also for HTML, not only feed.bozo.`

WIP Add http proxy support. Add more functionality to handle bookmarks. Split into more modules. Remove callback function initdb. Tasked status messages are broken. 2024-01-02 12:42:41 +01:00			`3) Add "if utility.is_feed(url, feed)" to view_entry and view_feed`
Add ClearURLs functionality. Fix Proxy functionality (remove www). 2023-12-27 23:48:31 +01:00
			`4) Refactor view_entry and view_feed - Why "if" twice?`

Split more functions into smaller functions 2024-01-02 19:11:36 +01:00			`5) Replace sqlite.remove_nonexistent_entries by sqlite.check_entry_exist`
			`Same check, just reverse.`

Add feeds, mionr improvements and notes 2023-11-22 12:47:34 +01:00			`"""`

Disable activation token and mastership mechanism 2023-12-05 09:18:29 +01:00			`from aiohttp import ClientError, ClientSession, ClientTimeout`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`from asyncio import TimeoutError`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`from asyncio.exceptions import IncompleteReadError`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`from bs4 import BeautifulSoup`
Add option to limit archived entries 2023-12-08 12:32:01 +01:00			`from email.utils import parseaddr`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`from feedparser import parse`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`from http.client import IncompleteRead`
Improve connectivity recovery 2023-12-24 19:37:05 +01:00			`from lxml import html`
WIP Add http proxy support. Add more functionality to handle bookmarks. Split into more modules. Remove callback function initdb. Tasked status messages are broken. 2024-01-02 12:42:41 +01:00			`import slixfeed.config as config`
			`from slixfeed.datetime import now, rfc2822_to_iso8601`
Add slixfeed.py for command line and split xmpp into modules. 2023-12-28 15:50:23 +01:00			`import slixfeed.sqlite as sqlite`
			`from slixfeed.url import complete_url, join_url, trim_url`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`from urllib import error`
			`# from xml.etree.ElementTree import ElementTree, ParseError`
WIP Add http proxy support. Add more functionality to handle bookmarks. Split into more modules. Remove callback function initdb. Tasked status messages are broken. 2024-01-02 12:42:41 +01:00			`from urllib.parse import urlsplit, urlunsplit`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00
Add preview commands (read and select) and experimenting with XEP-0249 2023-11-26 06:48:09 +01:00
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def dat():`
Add preview commands (read and select) and experimenting with XEP-0249 2023-11-26 06:48:09 +01:00
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def ftp():`

			`# async def gemini():`
Split main.py into modules 2023-10-24 16:43:14 +02:00
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def gopher():`
Split main.py into modules 2023-10-24 16:43:14 +02:00
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def http():`
Add preview commands (read and select) and experimenting with XEP-0249 2023-11-26 06:48:09 +01:00
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`# async def ipfs():`
Add preview commands (read and select) and experimenting with XEP-0249 2023-11-26 06:48:09 +01:00
Split main.py into modules 2023-10-24 16:43:14 +02:00			`async def download_feed(url):`
			`"""`
			`Download content of given URL.`
Update datahandler.py 2023-11-02 06:17:04 +01:00
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`Parameters`
			`----------`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`url : list`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`URL.`

			`Returns`
			`-------`
			`msg: list or str`
			`Document or error message.`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`"""`
Add user agent setting. Add command reset (mark as read). Fix error with command recent. Fix error with command stats. Thanks roughnecks for reporting these issues. 2023-12-18 16:29:32 +01:00			`try:`
WIP Add http proxy support. Add more functionality to handle bookmarks. Split into more modules. Remove callback function initdb. Tasked status messages are broken. 2024-01-02 12:42:41 +01:00			`user_agent = config.get_value_default("settings", "Network", "user-agent")`
Add user agent setting. Add command reset (mark as read). Fix error with command recent. Fix error with command stats. Thanks roughnecks for reporting these issues. 2023-12-18 16:29:32 +01:00			`except:`
			`user_agent = "Slixfeed/0.1"`
Improve connectivity recovery 2023-12-24 19:37:05 +01:00			`if not len(user_agent):`
			`user_agent = "Slixfeed/0.1"`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`headers = {'User-Agent': user_agent}`
			`url = url[0]`
Fix a couple of errors and split functions 2024-01-03 16:04:01 +01:00			`proxy = (config.get_value("settings", "Network", "http_proxy")) or ''`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`timeout = ClientTimeout(total=10)`
Add user agent setting. Add command reset (mark as read). Fix error with command recent. Fix error with command stats. Thanks roughnecks for reporting these issues. 2023-12-18 16:29:32 +01:00			`async with ClientSession(headers=headers) as session:`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`# async with ClientSession(trust_env=True) as session:`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`try:`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`async with session.get(url, proxy=proxy,`
			`# proxy_auth=(proxy_username, proxy_password),`
			`timeout=timeout`
			`) as response:`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`status = response.status`
			`if response.status == 200:`
			`try:`
			`doc = await response.text()`
			`# print (response.content_type)`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`msg = [doc, status]`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`except:`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`# msg = [`
			`# False,`
			`# ("The content of this document "`
			`# "doesn't appear to be textual."`
			`# )`
			`# ]`
			`msg = [`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`False, "Document is too large or is not textual."`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`]`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`else:`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`msg = [`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`False, "HTTP Error: " + str(status)`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`]`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`except ClientError as e:`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`# print('Error', str(e))`
			`msg = [`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`False, "Error: " + str(e)`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`]`
Fox issue with callback (adding URL) and an attempt to import specific parts of modules 2023-12-04 15:41:02 +01:00			`except TimeoutError as e:`
Split main.py into modules 2023-10-24 16:43:14 +02:00			`# print('Timeout:', str(e))`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`msg = [`
Segregate code into more particular functions 2024-01-04 02:16:24 +01:00			`False, "Timeout: " + str(e)`
Update 8 files - /slixfeed/sqlitehandler.py - /slixfeed/xmpphandler.py - /slixfeed/opmlhandler.py - /slixfeed/datahandler.py - /slixfeed/datetimehandler.py - /slixfeed/__main__.py - /slixfeed/confighandler.py - /slixfeed/filterhandler.py 2023-11-13 14:45:10 +01:00			`]`
			`return msg`