-
Notifications
You must be signed in to change notification settings - Fork 2
/
README
219 lines (207 loc) · 7.32 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
***************************************
* HTTP domain crawler *
***************************************
This is probably one of the most stupid things I ever made,
but I don't like to drop already typed code.
It is an HTTP crawler which looks for domains in <a> tags
and sotres them into a sqlite database next to their IP address.
It works recursively among the stored domains.
So in theory it can walk arround all the existing Internet.
Warning: It is a proof of concept! Very inneficient and probably buggy.
Dependences: pyton2, beautifulsoup, sqlite3, urlib
The usage is very simple, here an example:
p4u@eni4c:~/python-git/domainCrawler (master)$ python2 domainCrawler.py
Domain Crawler
Usage: domaincrawler <option> [params]
Options:
-n <filename> : Creates new database named filename
-c <database> <URL> <#threads> : Starts crawling process from given URL using database
-a <database> : List all domains stored in database
-h : Shows this help
p4u@eni4c:~/python-git/domainCrawler (master)$ python2 domainCrawler.py -n domains
New database created at file domains
p4u@eni4c:~/python-git/domainCrawler (master)$ python2 domainCrawler.py -a domains
p4u@eni4c:~/python-git/domainCrawler (master)$ python2 domainCrawler.py -c domains http://slashdot.com 12
## STARTING MAIN THREAD ##
|Crawling: http://slashdot.com|
## STARTING THREADS ##
|----------------------------
| Active threads: 2
| Queue size: 0
|----------------------------
|----------------------------
| Active threads: 2
| Queue size: 0
|----------------------------
|----------------------------
| Active threads: 2
| Queue size: 0
|----------------------------
|Crawling: http://ad.doubleclick.net|
|Crawling: http://techreport.com|
|----------------------------
| Active threads: 2
| Queue size: 33
|----------------------------
|Crawling: http://networkworld.com|
|Crawling: http://www.networkworld.com|
|Crawling: http://www.medicaldaily.com|
|Crawling: http://singularityhub.com|
|Crawling: http://thenextweb.com|
|Crawling: http://motherboard.vice.com|
|Crawling: http://theintelhub.com|
|Crawling: http://www.nature.com|
|Crawling: http://slashdot.org|
|Crawling: http://torrentfreak.com|
|Crawling: http://bgr.com|
|----------------------------
| Active threads: 12
| Queue size: 26
|----------------------------
|Crawling: http://www.ornl.gov|
|Crawling: http://paritynews.com|
|----------------------------
| Active threads: 12
| Queue size: 30
|----------------------------
|Crawling: http://tech.slashdot.org|
|Crawling: http://www.ubuntuvibes.com|
Cannot resolve lib1.isd.ornl.gov:8182
|----------------------------
| Active threads: 12
| Queue size: 35
|----------------------------
|----------------------------
| Active threads: 12
| Queue size: 35
|----------------------------
[CTRL+C]
p4u@eni4c:~/python-git/domainCrawler (master)$ python2 domainCrawler.py -a domains
ad.doubleclick.net -> 209.85.148.148
techreport.com -> 74.200.65.212
networkworld.com -> 65.214.57.165
www.networkworld.com -> 88.221.203.47
www.medicaldaily.com -> 209.116.59.144
singularityhub.com -> 199.27.134.217
thenextweb.com -> 184.106.146.102
motherboard.vice.com -> 184.73.209.110
theintelhub.com -> 208.109.162.209
www.nature.com -> 213.248.113.192
slashdot.org -> 216.34.181.45
torrentfreak.com -> 173.193.242.225
bgr.com -> 72.233.2.58
www.ornl.gov -> 128.219.176.20
paritynews.com -> 174.122.46.8
tech.slashdot.org -> 216.34.181.48
www.ubuntuvibes.com -> 209.85.148.121
roblimo.com -> 216.239.38.21
richarddawkinsfoundation.org -> 174.129.212.2
richarddawkins.net -> 75.101.145.87
en.wikipedia.org -> 91.198.174.225
www.securityweek.com -> 173.203.107.14
www.itu.int -> 156.106.202.5
twitter.com -> 199.59.150.39
www.facebook.com -> 66.220.152.32
rss.slashdot.org -> 209.85.148.121
freshmeat.net -> 216.34.181.101
feedproxy.google.com -> 209.85.148.118
freecode.com -> 216.34.181.104
geek.net -> 216.34.181.202
geeknetmedia.com -> 216.34.181.238
slashdot.jp -> 202.221.179.13
www.doubleclick.com -> 209.85.148.102
www.google.com -> 173.194.78.103
qosim.doubleclick.net -> 216.73.92.91
www.twitter.com -> 199.59.150.39
t.co -> 199.16.156.11
virtacore.com -> 69.65.110.234
techreport.pricegrabber.com -> 64.156.13.50
www.mars.com -> 213.248.113.202
arabicedition.nature.com -> 210.168.91.210
www.directive21.com -> 199.27.134.133
blogs.nature.com -> 199.168.13.95
www.purestcolloids.com -> 50.97.178.200
www.natureasia.com -> 210.168.91.210
network.nature.com -> 213.248.113.194
www.therawfoodworld.com -> 67.199.120.50
www.sbkb.org -> 128.6.20.132
doublegain.leanbodrep.hop.clickbank.net -> 74.63.153.62
btguard.com -> 209.44.102.196
bit.ly -> 69.58.188.39
a.collective-media.net -> 88.221.236.74
www.theintelhub.com -> 208.109.162.209
ryandownie.com -> 209.240.81.170
www.shadethemotionpicture.com -> 208.109.162.209
resources.networkworld.com -> 65.221.110.118
www.youtube.com -> 209.85.148.136
itjobs.networkworld.com -> 64.13.138.107
creativecommons.org -> 72.51.46.230
solutioncenters.networkworld.com -> 65.221.110.105
abcnews.go.com -> 68.71.212.40
sustainability-ornl.org -> 128.219.14.56
latino.foxnews.com -> 213.248.113.178
ut-battelle.org -> 128.219.176.20
cloudflare.com -> 173.245.60.249
events.networkworld.com -> 23.23.121.255
neutrons.ornl.gov -> 128.219.176.24
www.nytimes.com -> 170.149.168.130
jobs.ornl.gov -> 128.219.176.20
events.networkworld.com -> 23.21.174.147
www.linkedin.com -> 216.52.242.80
www.squidoo.com -> 64.225.155.21
www.flickr.com -> 87.248.105.216
m.networkworld.com -> 50.57.117.239
m.networkworld.com -> 50.57.117.239
www.americanpearl.com -> 68.142.205.137
www.techwords.com -> 64.124.14.63
www.singularityweblog.com -> 50.57.215.80
www.ers.usda.gov -> 162.79.16.192
www.blogger.com -> 209.85.148.191
careers.idg.com -> 63.211.160.21
www.jpss.noaa.gov -> 140.90.33.21
www.networkworldmediakit.com -> 66.96.147.104
www.jpost.com -> 94.127.75.170
www.ctvnews.ca -> 213.248.113.179
feeds.feedburner.com -> 209.85.148.118
www.monkey.org -> 75.102.5.19
www.AWAKEN.com -> 74.220.215.227
www.passagesmalibu.com -> 72.52.210.179
spectrum.ieee.org -> 213.248.113.177
go.usa.gov -> 216.128.241.32
3.bp.blogspot.com -> 209.85.148.139
www.thesundaytimes.co.uk -> 213.248.113.210
www.oakridge.doe.gov -> 198.207.239.8
truejay.com -> 72.32.231.8
alt.engadget.com -> 195.93.85.49
www.stylesurge.com -> 64.13.239.189
images.medicaldaily.com -> 68.232.35.229
lenashagieva.com -> 173.203.204.123
www.thisrecording.com -> 65.39.205.54
www.singularityhub.com -> 66.240.232.112
www.nih.gov -> 137.187.25.43
www.itworld.com -> 23.21.168.45
tiheum.deviantart.com -> 199.15.160.100
www.techhive.com -> 70.42.185.60
bldgblog.blogspot.com -> 209.85.148.132
matrixsynth.blogspot.com -> 209.85.148.132
www.cdc.gov -> 213.248.113.211
negrophonic.com -> 69.163.184.27
www.counselheal.com -> 209.116.59.144
thoughtcatalog.com -> 76.74.254.120
www.ericmmartin.com -> 204.197.248.152
betanews.com -> 69.65.109.143
science.energy.gov -> 205.254.148.104
thediplomat.com -> 69.164.223.86
www.devour.com -> 50.23.249.68
techcrunch.com -> 74.200.247.61
1.bp.blogspot.com -> 209.85.148.139
www.economist.com -> 64.14.173.20
code.google.com -> 209.85.148.139
thesocietypages.org -> 69.164.207.151
c3046032.r32.cf0.rackcdn.com -> 146.82.2.122
www.properlydecent.com -> 46.19.38.61
c3414097.r97.cf0.rackcdn.com -> 146.82.2.99
www.naturenplanet.com -> 209.116.59.158
p4u@eni4c:~/python-git/domainCrawler (master)$ ls
domainCrawler.py domains
p4u@eni4c:~/python-git/domainCrawler (master)$