's data as csv was: Re: [linux-audio-dev] Linux soundapps pages updated

New Message Reply About this list Date view Thread view Subject view Author view Other groups

Subject:'s data as csv was: Re: [linux-audio-dev] Linux soundapps pages updated
From: Jan Weil (
Date: Mon Apr 19 2004 - 20:51:12 EEST

On Mon, 2004-04-12 at 18:29, Paul Winkler wrote:
> FYI, I'm still planning to implement my own proposal which has
> been discussed quite a lot in the L-A-U archives.
> I do somewhat similar sites for a living. It just needs me to block out a
> chunk of time (1 or 2 weekends) to bang it out.

Hi Paul,

I'd like to assist so I've written a little script to automatically
extract all the links from
It depends on Ruby, wget, lynx and sed.

The output is tab separated csv (tsv) containing three fields per row:
text, urls and category.
Some of the <li>s contain more than one url. For these the urls are
separated by blanks ' '.
The category is either the title (<h3>) of the subpage or the text which
belongs to the list item containing the current (<ul>).
The script expects at least one url from as arguments
(one of the subpages). So you'll also need a working internet
A '-H' prints an additional csv header.
I also attached a bash script to extract all the subpages from

If you have any problems with this script I can send you all the data
off list.



P.S. Follow-up to LAU?

#!/usr/bin/env ruby

# This little piece of software is free in every sense of the word.
# Mon, 19 Apr 2004, Jan Weil <>

if ARGV.include?("-h") || ARGV.include?("--help") || ARGV.size == 0
        puts "usage: #{File.basename($0)} [-H] URL..."
        puts "-H --header\tadd csv header"

if ARGV.include?("-H") || ARGV.include?("--header")
        $print_header = true

def extract_urls(str)
        urls = []
        url_regex = /\[(\d+)\](\S.+)/
        loop do
                if str =~ url_regex
                        str.sub!(url_regex){|s| $2}
        if not urls.empty?
                return urls.join(" ")
                return false

def push_li(line, level, regex)
        next_line = ""
        loop do
                next_line = $lines.pop
                if next_line =~ regex
                        line += " #{$1}"
        urls = extract_urls(line)
        $data.push({"text" => line, "urls" => urls, "cat" => $cat[level] || "None"}) if urls
        $cat[level+1] = line

ARGV.each do |url|
        $reference = []
        $cat = []
        $data = []

        # XXX this works, at least for
        url =~ /(\w+\.\w+)$/
        loc = $1 or raise("Help me at XXX!")

        `wget #{url}`
        if $? != 0
                exit 1

        tmp = loc + ".dump"

        # unset locales (we need ^References$)
        ENV["LANG"] = "C"

        `lynx -dump #{loc} > #{tmp}`
        if $? != 0
                STDERR << "calling lynx failed! Is it installed?\n"
                exit 1

        # extract link list (legend)
        out = `sed -n '/^References$/,$p' #{tmp} | sed -n '3,$p'`.split(/$/)
        if $? != 0
                STDERR << "calling sed failed! Is it installed?\n"
                exit 1

        out.each do |line|
                ary = line.split
                $reference[ary[0].to_i] = ary[1]

        # extract data
        $lines = `sed -n '1,/^References$/p' #{tmp}`.split(/$/)


        # we need a stack

        # traverse all lines
        loop do
                line = $lines.pop
                break if not line
                # title
                if line =~ /^ (\S.*)$/
                        $cat[1] = $1
                # li level 1
                if line =~ / \* (\S.*)$/
                        line = $1
                        push_li(line, 1, /^ (\S.*)$/)
                # li level 2
                if line =~ / \+ (\S.*)$/
                        line = $1
                        push_li(line, 2, /^ (\S.*)$/)
                # li level 3
                if line =~ /^ o (\S.*)$/
                        line = $1
                        push_li(line, 3, /^ (\S.*)$/)
                # li level 4
                if line =~ /^ # (\S.*)$/
                        line = $1
                        push_li(line, 4, /^ (\S.*)$/)
                # there is no higer level, right?

        $data.sort! do |a, b|
                if a["cat"] == b["cat"]
                        ret = a["text"] <=> b["text"]
                        ret = a["cat"] <=> b["cat"]

        print "Text\tUrls\tCategory\n" if $print_header
        $data.each do |hash|
                print "#{hash['text']}\t#{hash['urls']}\t#{hash['cat']}\n"

New Message Reply About this list Date view Thread view Subject view Author view Other groups

This archive was generated by hypermail 2b28 : Mon Apr 19 2004 - 20:50:16 EEST