Skip to main content

My First Python Script

·4 mins· loading · loading · ·
Python Script Scripting Programming Linux"
DarkXero
Author
DarkXero
I create content about Linux & FOSS software. You’ll find me rocking ArchLinux with KDE Plasma, probably helping somebody with a question. 😎
Table of Contents

What prompted me ?

Well, as I was porting all relevant forum posts to this site in markdown, I needed a better way to do so. Someone on Mastodon gave me the idea of writing a python script that does it for me. So I went on learning spree. Not really…

I gotta be honest with you guys, I had to use ChatGPT. Doing it by watching YouTube videos, or by searching the net would have taken me ages. So I went to the thing that does things much faster. And before you say it, I am well aware that we can’t rely on it 100%, coz it’s very inaccurate and licensing is an issue. So I was careful. Anyway, I will post what I have learned so far, maybe it can be of use to you.

The code

Before we begin, I need to mention that code is not perfect since it’s my very first script. But it did the trick for me. It could still use some tweaking, so if you know how I can make it better, please feel free to let me know how. Use any of my socials to contact me.

Issue with script is, formatting, as in result Markdown is not well formatted, with a huge amout of dead space, underscores all over and embedded videos are always put at the end of posts. Not to mention that code snippets are not labeled as Bash or Yaml etc..

With that being said, let’s begin shall we ?

Before going through the code, make sure you have the following packages installed on your system. Also forum posts need to be accessible withou an account otherwise script will fail miserably.

sudo pacman -S python python-beautifulsoup4 python-requests python-markdownify

Now here’s the code, then I will explain…

import requests
from bs4 import BeautifulSoup
import markdownify
import os

def convert_to_markdown(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract the main content of the post
    post_content = soup.find('div', {'class': 'post_body'})

    if not post_content:
        return "Could not find the post content."

    # Convert HTML to Markdown, including embedded YouTube videos
    markdown = markdownify.markdownify(str(post_content), heading_style="ATX")

    # Process embedded YouTube videos
    for iframe in post_content.find_all('iframe'):
        src = iframe.get('src')
        if 'youtube.com' in src or 'youtu.be' in src:
            video_id = src.split('/')[-1].split('?')[0]
            youtube_markdown = f"\n[![YouTube Video](https://img.youtube.com/vi/{video_id}/0.jpg)]({src})\n"
            markdown += youtube_markdown

    return markdown

# URLs of the forum posts
urls = [
    "https://forum.xerolinux.xyz/thread-131.html"
]

# Create an output directory if it doesn't exist
output_dir = 'markdown_posts'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Convert each URL to Markdown and save to file
for url in urls:
    markdown_content = convert_to_markdown(url)

    # Extract thread ID from URL to use as filename
    thread_id = url.split('-')[-1].replace('.html', '')
    output_file = os.path.join(output_dir, f"post_{thread_id}.md")

    with open(output_file, 'w') as file:
        file.write(markdown_content)

    print(f"Saved Markdown for {url} to {output_file}")

Basic Explanation

  • Import Libraries: We use requests to fetch the webpage content and BeautifulSoup to parse the HTML. The markdownify library is used to convert HTML to Markdown.
  • Extract Content: The script extracts the main content of the post by looking for the div with class post_body.
  • Convert to Markdown: The script uses the markdownify library to convert the HTML content to Markdown and processes embedded YouTube videos.
  • Save to File: The script saves each converted post to its own Markdown file. The filenames are derived from the thread IDs in the URLs.
  • Output Directory: The script creates an output_dir directory if it doesn’t exist and saves the Markdown files there.

How to Run the Script

  • Save the Script: Copy the script into a file named forum_to_markdown.py.
  • Make sure you mark script as executable via chmod +x forum_to_markdown.py
  • Run the Script: Execute the script by running the following command in your terminal:
python forum_to_markdown.py

This script will output each post to its own Markdown file in the markdown_posts directory.

That’s it for now. Do let me know what you think.

Related

Amelia Automated Archinstall
·4 mins· loading · loading
Automation Bash Script Amelia Linux Arch ArchLinux
Proxmox VE Helper-Scripts
·2 mins· loading · loading
Proxmox ProxmoxVE Scripts Scripting HomeLab Automation Linux
Caps-Lock Delayer Script
·1 min· loading · loading
Caps-Lock Scripting Tools Linux