LabWindows/CVI

cancel
Showing results for 
Search instead for 
Did you mean: 

downloading http source code from an url, using datasocket capabilities

Hello members,

 

I would like to use the datasocket capabilities to download the http source code from a pre specified URL.

The problem I am facing is, the cookie handling. The URL where I want to download the http source code, is only available with cookies enable.

 

When I am using the firefox to connect to the URL with disabling the cookie handling, the url just returns an "loading in progress page".

When cookies are enabled the URL returns the real page I am after.

 

I've searched througt the forum an couldn't get any answer.

 

Is it possible for NI datasocket to handle cookies or is it a different problem.

 

In a previous thread there was an example program for reading website. Maybe someone of you could help me out ad have any glue about to solve the problem.

 

Cheers and thanks in advance

Michael

0 Kudos
Message 1 of 11
(4,583 Views)

what if cookies are enabled but there are no cookie set yet ? does reading the page redirects you to the final page, setting the cookie by the way ?

may we know which website you are trying to access so that we can test ?

0 Kudos
Message 2 of 11
(4,575 Views)

Hello again,

 

I don't really know what you ment with enable / set a cookie. When I disable/block the cookies in my browser (firefox) the the page reloads up to 3 times with a loading in progress statement. After 3 times reloading the page gives an error that I have to enable the cookies in my browser.

 

The thing is that I am working on my thesis "stochastical mathematics in sporting bets". For that I need to get as much data as possible from the page.

The URL is www.betbrain.com. This page gives an overview of couple of bookmakers and is the fundamental source for playing around with the mathametical background.

 

Thanks for your help.

 

Cheers Michael

0 Kudos
Message 3 of 11
(4,566 Views)

ouch, i just made some tests with this site. that's awful... you will need something more powerful than a datasocket to read a page here.

 

when trying to access the root directory, it answer with a page (the infamous "loading, please wait...") which contains an html meta refresh, which redirects to another identical page, with another meta refresh... this looks a lot like a forced delay when opening the site, i bet it is to prevent users from refreshing their bets too often. after 3 reloads, it redirects (still using a meta refresh) to another site which set the offending cookie by using javascript (did they bother to read the HTTP RFC once ?). i stopped there, it's already complicated enough to get the index page.

 

now, have you tried using some tools made for this job ? i think "wget" (a unix command line tool to download entire sites) should allow you to download the index page. other tools exists which should do the job. it is still possible to write a program which connects to this site, if you are interested, i may dig a little bit deeper (but i will not use a datasocket, there are lot of functions which better suit the job than using a datasocket).

0 Kudos
Message 4 of 11
(4,564 Views)

Hello,

 

well that doesn't sound that good :-(. I was reading a bit in the net , but a good solution doesn't come to my mind.

I thought about something like commanding an ordinary browser over DDE or something like this.

But Firefox and Opera seems not to have that good interfaces to do this.

 

But my experience is not that good. I even couldn't find the tools that you mentioned. 

I'm using a windows based system so the unix commands are not usefull for me :-(.

 

What would you use instead of datasockets? Cause I thought that the only way of doing this with CVI is to use datasockets.

 

Thanks in advance

 

Cheers

Michael

0 Kudos
Message 5 of 11
(4,556 Views)

Hi me again :-),

 

I have found a windows version of the wget. It seems to be really powerful. The problem is that is really not easy to get the page.

And to be honest I've no experience in using that tool.

0 Kudos
Message 6 of 11
(4,542 Views)

i tried and had to cheat a little bit... (i hate website which are using javascript to set cookies !)

 

attached to this post is a file containing a single cookie: i hope it will live long enough to be useful to you. save this file on your disk then use wget with this command line:

wget  --load-cookies cookies.ini --save-cookies cookies.ini --keep-session-cookies www.betbrain.com

 

this should allow you to retrieve the index page and work with it. there are many command line options in wget, i encourage you to play a bit ith them.

0 Kudos
Message 7 of 11
(4,535 Views)

Hi again,

 

thanks for your help. I tried the cookie with the wget command. It doesn't seem to work. I only retrieve the standart "loading ..." page.

How did you generate that cookie? I was playing nearly the whole night with that wget tool. I also tried to use the cookie file from Firefox, but every time only get the same useless page. 

 

Just to understand what is happening in here. 

 

1. I connect to the server.

2. The server setup a new cookie per JavaScript.

3. I should use that cookie and should retrieve the html code of that page?

 

When I try to get that cookie like this "wget -E --keep-session-cookies --save-cookies=C:\cookies.ini http://www.betbrain.com", I only retrieve a cookie file like the file below.

 

 

Cheers Michael

 

 

 

0 Kudos
Message 8 of 11
(4,533 Views)

the first cookie is set by javascript, inside the HTML. it looks like this:

<body onLoad="document.cookie='cDRGN' + '=' + '150544890';">

 

so you need a cookie which looks like cDRGN=150544890. it seems they are making some validitiy checks on this cookie. open the file you retrieve, look at the source and you will find the value which is valid for your connection. then edit the cookie file attached in my previous post: you will find a single line containing the cDRGN cookie, replace its value. now use the --load-cookies options as i described in my previous post, and everything should be fine. i was able to retrieve the main page multiple times without having to edit the cDRGN cookie anymore. 

 

if you feel the need to have a more automated way of doing this, use your prefered programming language to launch wget, then open the output file and search for onLoad="document.cookie=' and launch wget again using the --header option which will let you insert the cookie directly in the headers. (the syntax of the Set-Cookie header is not straightforward. see RFC2965 for an authoritative description. of course, this requires that you understand the basics of the HTTP protocol which is described in RFC2616: a must-read for anyone messing with this kind of stuffs)

0 Kudos
Message 9 of 11
(4,529 Views)

Hello,

 

thanks very much, that helped.

I already wrote an program to gernerate the dummy cookie in a way you described. It works fine.

 

The only think what I would like to know is what does this line means? Especially the TRUE / FAlSE statements?

 

"www.betbrain.com    TRUE    /    FALSE    1317108021    cDRGN    150544890"

 

Thanks for your support.

 

Cheers Michael

0 Kudos
Message 10 of 11
(4,521 Views)