I need some help with a programming task: use Python to grab a web page, and email it so the email can be read normally

in #steemit7 years ago

This is surprisingly much more difficult than the simple description in the subject. That’s because many mail readers don’t like to render CSS; for instance, if you’re sending to Gmail you need to take apart the CSS and decorate every individual element on the page with the styling, as Gmail ignores CSS.

And, that’s not the only issue. I’m trying to email myself a Steemit page, and I’m reading mail via Thunderbird (a Mozilla project, like Firefox but for mail reading). It puts an “X” in a box at the top, then shows a bunch of menu entries, one on each line, then a bunch of whitespace, and then it does show the entire post including the comments, but the formatting still looks wrong, with lots more whitespace than in the web browser.

I’d rather not render it as an image, and take screencaps, as I want there to be searchable text in the emails.

When I load a Steemit page in a browser, I see some things “paint” after it loads — for instance, load my Replies page, and the upvote arrows are all initially white (meaning hasn’t been upvoted), and a few seconds later, the ones that I’ve upvoted turn blue. So that’s an action that is happening deeper down, like with AJAX or Javascript.

I’ve found some Javascript render packages (different form of rendering than the above -- the Javascript ones are supposed to load the page elements that load after the HTML is loaded, so I'd have the "full page"), but I can’t get any of them working.

I’m using Python 3 in Ubuntu, and am using the Python module “requests” to get the page. I think the module “premailer” should be helpful here, but I haven’t managed to make it work yet. Right now, it gives a huge amount of warnings and errors, taking about five minutes before it then throws an exception — which Komodo (an IDE by ActiveState) says it can’t find the source code for, so it stops there.

I’ve found a few other resources, like the module “scrapy”: https://doc.scrapy.org/en/1.5/topics/email.html — however, it isn’t clear whether their email code will parse HTML/CSS correctly, haven’t tried it yet.

Tried a bunch of other things and am rather frustrated right now, so asking for help.

This one, I couldn’t install the prerequisites: https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/ — I did what it stated, “sudo apt-get install python-qt4”, which appeared to succeed; however, when it gets to the second line in the script, it raises an exception:

Traceback (most recent call last):
File “/home/steemitdev/get_page_for_email.py”, line 2, in
from PyQt4.QtGui import *
ModuleNotFoundError: No module named ‘PyQt4’

I’ve tried installing that in various ways, and nothing has worked. This page https://stackoverflow.com/questions/40277382/pyqt4-installed-but-importerror-no-module-named-pyqt4-in-ubuntu-16-04 seems to indicate a way forward, but the “pyvenv” command it says to run gives an error:

$ pyvenv –system-site-packages .
WARNING: the pyenv script is deprecated in favour of python -m venv
Error: Command ‘[‘/home/steemitdev/projects/example.com/steemit/bin/python’, ‘-Im’, ‘ensurepip’, ‘–upgrade’, ‘–default-pip’]’ returned non-zero exit status 1.

Tried Spynner, from https://pypi.python.org/pypi/spynner — but apparently it’s got a Python 2-style “print” statement, as it complains like so:

File “/tmp/easy_install-tw3rrh31/autopy-0.51/setup.py”, line 50
print ‘Updating init.py’
^
SyntaxError: Missing parentheses in call to ‘print’.

I know that one difference between Python 2 and Python 3 was that print was made into a proper function, so in Python 3 it must be called with parentheses; thus, the above line should really be:

print(‘Updating init.py’)

But I can’t change that file because it’s expanded during the operation, and appears to be part of Python module “autopy”. Let’s try installing that separately, via “sudo apt install autopy”: yep, that errors out as well, with the same print issue. So, there’s a bug in autopy, and I can’t move forward with this approach, at least not until they’ve fixed that bug. Wonderful.

Anyway I’m starting to curse and I don’t want to give energy to the dark forces, so I’m writing this post and taking a nap.

Thanks for any assistance you can provide!




Sort:  

Hey @libertyteeth I tried to googl some code but to me its not totally clear what u want to achieve, so I ask back some questions:

  • Why do u want to grab a full website? Are u not interested only in a small part of it or a text out of it. Theres gonna be lots of data garbage when u store snapshots of webpages.
  • Why do u want to email a full website to u/someone? Yes there is a good reason for email programs to not allow images/javascript/external links for security reasons. And a website in email doesnt seem comfortable to me to read/watch.

One way you could achieve ur project goal as i understand it in a bit other way is:

  • Download full website and with wget and archive it in a folder.
  • For searching u could use find or linux tools to search ur full archive.
  • If u want the archive available to others, rsync to a webdrive and share it.

Cheers J

Thanks! I'm working on a project to email me when there's a new post by a Steemit user. @libertyranger (no relation :) ) had created one for @haejin, but it seems to have stopped working so I figured I'd flex my development muscles.

I've got it working, sending an SMS to my phone. I wanted to expand on it, so that it could detect (or be told) "this email is to a phone" or "this isn't to a phone" -- and for the latter, to add the entire post.

There are two reasons for that -- first, it saves a click; the user just has to scroll down while viewing the email. Second, it preserves it for historical reasons, in case Steemit ever goes down, or the post changes, etc.

To your questions, it's possible that I could grab a smaller portion of the blog page. I like the way my solution is trending -- but the email doesn't read well, because apparently Steemit does some loading of page elements after the HTML is loaded. Meaning, wget might not be enough, either.

It's more for convenience than anything else, and thus really should be an "extra credit" i.e. low priority, side task. But sometimes I get stuck on something... :)

Ive not had the time to come up with a good solution.
But you're a dev. You'll find a way to do this.
J J J J J

Are you just trying to be able to read the posts or does it have to be exactly the same css wise? Are you using steem-python to get the data sets?

It would be great if the result was the same, CSS-wise; it seemed like premailer would do the trick, but it just spat out warnings and errors for five minutes, then died in a manner Komodo couldn't debug!

I was going to use steem-python, but the data source wasn't working when I started writing the scripts, so I switched to web scraping instead -- as, I know that'll work, even if it's slightly less efficient than getting it from the blockchain itself. (That data source issue might have been the reason that @libertyranger's SMS service for @haejin's posts stopped working, as well.)

I answered another question just now as well, with reasons for writing this. Thanks for your help!

I hope they don't delete this post too

"They" didn't; "I" did.

congratulation

Great information.i read your post & i always support you @libertyteeth

i will done upvote and resteem.

thats pretty cool to know....

i am so sorry .because i am a new comer.

Now its on „waiting in queue #41“ - Meaning 40 people get encoded before you! The upload worked, just wait

Well and thanx @libertyteeth for give your reviews and share analysis and your always helpful post on stock is great work

Dear @minaraju your style of copy and paste commenting is considered spam. @steemflagrewards