What Would You Like To See In A SearchEngine ? - Page 2

Question

Dani 4,653 The Queen of DaniWeb

1 Year Ago

Why would I, as the website publisher, fill out a form on your search engine's website anytime I update a page?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Dani 4,653 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 1 · 2023-03-15T19:43:13+00:00

I can already track organic traffic from non-Google search engines with Google Analytics?

borobhaisab 117 Posting Whiz · Answer 2 · 2023-04-24T18:24:47+00:00

People,

Give me feedback on these 2 methods to make a great searchengine.

Listed links should earn from activities of visitors, even if they are:

A. non-buyers;
B. back button hitters;
C. competing links.

Keyword Searchers should earn in some way from their search activities.

I am revolving my features around these 2 ideas to dream-up new features for a searchengine. What about you ? Or, you have better ideas ?

borobhaisab 117 Posting Whiz · Answer 3 · 2023-04-24T18:29:15+00:00

Programmers,

Why do not we brainstorm unique features for a searchengine and each of us can work on our own thoughtup feature and share the codes here for others to learn coding. Ofcourse, the codes should be no strings attached. And free.
And then all of us on our ends can compile all the features and codes found on this thread and edit the codes tailoring here & there and build our own unique searchengines.
You never know, this forum might even adapt our features!

So, to begin with, can you thinkup features that you reckon will make it more interesting than the bog boys out there ? Let us give them a bloody nose!

Maybe you know that I want more than search. Look at GPT and how we now can get answers.

borobhaisab 117 Posting Whiz · Answer 4 · 2023-04-30T15:34:53+00:00

@jacksonshirley

Thank you for your indepth post. I just read it now after 3 weeks as I missed it till now.

borobhaisab 117 Posting Whiz · Answer 5 · 2023-05-02T12:15:43+00:00

@rprofitt

bing.com uses ChatGpt 3 or 4, I saw in youtuibe vid lastnight. You reckon ChatGpt will kill the searechengines ?
Leaving aside ChatGPT or AI stuff, what do you miss in a searchengine ?

GPT is not a search engine. But some think it is. It is however a threat to search engines. Maybe someone else will write at length on that.

rproffitt 2,701 https://5calls.org Moderator · Answer 6 · 2023-05-02T14:07:20+00:00

Google and other search engines suffer from GIGO (garbage in garbage out). Until a few years ago folk mostly put up mid to great content on the web but SEO meant that many thought that spamming the web with garbage was a way to get rank or hits. So much garbage on the web so the old way of indexing the web is fairly borken.

GPT is not search. Have you thought about that?

Jawass 16 Oldbie · Answer 7 · 2023-05-02T17:06:22+00:00

Text-to-text search, speech-to-text search (voice search), image search are likely to be seen in a search engine. When articles are published, they are automatically loaded to search engines and shows up when using a search engine for research.

borobhaisab 117 Posting Whiz · Answer 8 · 2023-05-04T19:52:18+00:00

borobhaisab 117 Posting Whiz

2 Years Ago

@jawass

Thanks for your suggestions.

borobhaisab 117 Posting Whiz · Answer 9 · 2023-05-05T15:42:52+00:00

Pro Gurus,

My php searchengine project is finished. But my web crawler project is not.
Anyways, I have decided that, on my searchengine, I will add a link for you to submit your website.
But, since my SE will be unknown, it will be silly to expect the world to start submitting their links and have my index built that way. That is why building the php crawler.
Anyway, I have decided that, my web crawler will not be crawling the web the old fashion way, where I point it to a link and it crawls all links and find it's way to other domains. Untidy.
Instead, I am going to program the crawler to not wander off to other domains other than where I set it to.
So, I will manually first set it to one domain's Xml Sitemap. To one url only. Then, it will stay on domain and extract and crawl all the site's pages.
Then, it will move-on to the next domain's Xml Sitemap that is on the list. And do the same.
Now, the question is, how can I find the Xml Sitemap links of all websites in existance ?
What is your method to find this out ?
I have my own method. I will relate to you what it is.. You tell me, if it is flawed or not. Or, whether there is a more efficient way to crawl the web or not other than what I have chosen.

I had in mind to run my own dns cache. That way, I get hold of all the domains in operation across the globe. But dealing with BIND is too technical for me. So, do you know of any Windows OS freeware/shareware/GPL/ etc. ones instead ?
If not, then I will have no choice but to buy the domains list from here:
https://domains-monitor.com/domainzones/
(Note, no affiliate links. Plus, it's not my website).

After I have downloaded all the active domains, I will use the techniques mentioned in the following tutorial links on how to find a website's Sitemap. I will program the php crawler to use those techniques to generate urls (possible Sitemap urls for each Domain).
And then will have to program the php crawler to navigate to the urls it generated to see if the generated links are valid or not. I will open a new thread in the php section, to get help how to write php code for a crawler to test whether a url is live or dead (exists or not).
It is not that hard for me to build a .exe crawler (desktop software) for Windows. And so, while I am still learning how to build a web version one (.php), I might aswell get the desktop crawler crawl the web and harvest links. Then, when my website is up & running, I can then upload the urls list to my website to build the searchengine Index.
Originally, I did not want to build a .exe crawler as I did not want to have my home computer on 24/7. Thought best, I get the php crawler, which'll be installed on my paid hosted vps. But, I got tonnes of internet data saved on my fone sim. Guess how much data ?
I buy data in GB every week. Actually, for 2-3yrs been buying 10GB/wk. Sometimes they give another 10GB/wk bonus. It only costs me around $3USD/£2GBP/2.5Euros. In your countries, how much will it cost you to buy 10GB and does your ISP provide you another 100% bonus ?
Anyway, I actually spend about 4GB/wk out of 10GB/wk. So, if I renew the pack again then my saved data (6GB_ gets rolled over. That is how, I managed to save 100GB. But once I forgot to renew and wham! I lost all that 100GB!
This happened thrice to me in one year. I think in 2021. So, I lost 300GB that got saved rolling over each week!
Anyway, this time I been careful for nearly 1/5yrs and have lost no rolled over data. Guess how much I managed to save of these rolled over datas this time ? Let me check my cell fone. One moment ...

1036946.37MB

So, that is approx 1.3TB.
Let me know how much data you have managed to get rolled over (unuased data) like this on your mobile phone sim.
Oh, recently, I lost another 50GB approx of bonus data that I did not finish using.
Anyway, ave how much is a website in MB/GB/TB ? Let me calc how many websites ave my .exe desktop crawler will be able to spider before I runout of all saved rolled over unused for 1.4yrs data.
Yes, we do have broad band. That the household users use to browse youtube most. I am planning on stop buying data for my mocile sim and rely on home broadband. But if I quit buying the data bundle on sim then I will immediately lose out on the 1TB data that got rolled over the past 1.5yrs. And so, that is why I am planning now to harvest links for my SE index using the desktop crawler. Get it to make use of all the saved data on my mobile sim. Once that data has run out then when the php crawler is finished then I can get the php crawler run on my webhost's side.

Any advice ?

Thanks for reading my searchengine story.

rproffitt 2,701 https://5calls.org Moderator · Answer 10 · 2023-05-05T16:21:20+00:00

Thus just in: AI-Powered Visual Web Crawler.

Seems like those old school web crawlers may be in for a fight!

borobhaisab 117 Posting Whiz · Answer 11 · 2023-05-05T16:30:12+00:00

This is how I will find Xml SiteMaps of each domain name that got one:
https://seocrawl.com/en/how-to-find-a-sitemap

borobhaisab 117 Posting Whiz · Answer 12 · 2023-05-05T16:58:10+00:00

@rprofitt

Thanks mate! You are the man! I checking it out now!
I get the feeling you are into AI stuffs!
But is the crawler 100% freeware or is it shareware ?

borobhaisab 117 Posting Whiz · Answer 13 · 2023-05-05T17:03:56+00:00

@reverend_jim

You know of any good BIND alternatives for Windows as I got to run my own dns cache just to download allactive domains and their email addresses in the zones.

borobhaisab 117 Posting Whiz · Answer 14 · 2023-05-05T17:40:55+00:00

@dani

What is your advice ? How can I get hold of all active domains in the world the free way ?
Or, should I just go and download from:

https://domains-monitor.com/domainzones/

And you know of any good AI crawlers (freeware/open source/gpl, etc.) be they .exe ones or .php ?

Dani 4,653 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 15 · 2023-05-05T20:38:47+00:00

It sounds to me like you are bypassing the very nature of the world wide web, which is that it is an interconnected web of links pointing between and across websites. A long list of every known domain is going to be a lot of radio static + garbage to weed out.

Keep in mind, XML sitemap files can contain URLs for different domains (owned by the same Google search console account).

You can discover XML sitemap files by checking the robots.txt for a domain name. Sometimes, it's included there. If not, you will still want to be a good crawler and follow robots.txt directives (meaning, don't crawl any URLs disallowed in a domain's robots.txt file).

borobhaisab 117 Posting Whiz · Answer 16 · 2023-05-06T15:38:24+00:00

@dani

The robot.txt directives are too messy. Building a bot to abide by it will be over this beginner leveled skills programmer's head. Unless ofcourse, you yourself find it interesting and chime in.

I thought the Xml Sitemaps are built for crawlers. If so, then they will only list those links they want crawled. Right ? In that case, no need for my crawler to deal with robots.txt file. Cutting the chase short.

rproffitt 2,701 https://5calls.org Moderator · Answer 17 · 2023-05-06T16:17:25+00:00

For now, don't fret about robots.txt. It's a request, not mandated. From https://wiki.archiveteam.org/index.php?title=Robots.txt

The purpose and meaning behind the creation of ROBOTS.TXT file dates back to the early 1990s, when the then-new World Wide Web was quickly defining itself as the killer application that would change forever how users would interact with the growing internet. Where previous information networks utilizing internet connections such as GOPHER and WAIS were text-based and relatively low-bandwidth, the combination of text, graphics and even sounds on webpages meant that resources were stretched to the limit. It was possible, no joke, to crash a machine with a Netscape/Mozilla web browser, as it opened multiple connections to web servers and downloaded all their items - the optimizations and advantages that 20 years of innovation have brought to web serving were simply not there. As crawlers/spiders/search engines came into use, the potential to overwhelm a site was great. Thus, Martijn Koster is credited with the Robot Exclusion Protocol, also known simply as the ROBOTS.TXT file.

Dani 4,653 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 18 · 2023-05-06T16:27:29+00:00

I thought the Xml Sitemaps are built for crawlers. If so, then they will only list those links they want crawled.

If you're just going to crawl pages in sitemap files, then yeah, I think it's safe to ignore robots.txt.

As far as the directives being too messy to navigate, I personally have experience using the RobotsTxtParser and RobotsTxtValidator libraries by Eugene Yurkevich. Here is a link to them on Github.

rproffitt is correct in that adhering to them is not legally required, in most cases (meaning there's not necessarily going to be a legal court case against you). However, not adhering to them is a very easy way for your crawler to get blacklisted, banned, or worse.

I couldn't find a good source about a ban but getting unbanned is a whole new world of pain.
rprofitt and you have given me my answer and have put my fears to the grave. I was worried, my crawler might list secret pages and find myself sued.

borobhaisab 117 Posting Whiz · Answer 19 · 2023-05-12T15:05:49+00:00

@rprofitt
@dani

Do not worry. When my SE gets popular, who would want to ban my crawler ? Who wants to ban Google Bot ?

Dani 4,653 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 20 · 2023-05-12T16:27:53+00:00

A lot of people pay their hosting companies for bandwidth and cpu usage. If a particular search engine bot spends a lot of time crawling your site, but it doesn’t send you a lot of visitors or customers, then it’s common to ban it. People see it as a waste of bandwidth, especially in cases where the website only makes money if a visitor purchases their product, and it’s not the type of site that typically gets search engine visitors turning into customers.

Another reason people ban search bots is when they don’t agree with their principles. For example, many people don’t like when Google shows an answer box at the top of its search results that directly answer the searcher’s question, instead of simply linking to the webpage containing the information. They see it as Google stealing their content. They spend a lot of money on staff writers and research, and only make money when visitors see their ads, but Google (and other search engines) give searchers access to their valuable content without the benefit of showing the searchers their ads as well. This has been a growing concern in the SEO community as it relates to ChatGPT and AI utilizing publisher’s content with no benefit to the publisher. All it does is cost the publisher bandwidth.

Good thing you mentioned what people do not like about ChatGpt and Google doing with their content. I will avoid this and they will like my crawler.

Dani 4,653 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 21 · 2023-05-12T16:29:35+00:00

And if you never realized it was as big a problem as it is, for every visitor that arrives at DaniWeb from Google, Googlebot crawls an average of 5 daniweb.com pages.

It’s actually quite common for very small sites, with cheap hosting, but very high quality content, to have their servers overloaded by legitimate search bots, and unable to serve content to actual website visitors.

Google has sophisticated algorithms in place nowadays to ensure that they don’t overload servers, but nothing is perfect. That’s why they have the Crawl Rate tools and crawl delay directive in sitemap.xml files.

My bot will crawl your site, who's pages are found in your SiteMap. After that, it wil lnot re-visit, unless you ping it crawl updates. Issue solved!

borobhaisab 117 Posting Whiz · Answer 22 · 2023-05-17T16:00:20+00:00

@dani,

Check my above 2 comments on your feed-back. I solved all your worries.

Dani 4,653 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 23 · 2023-05-18T16:51:42+00:00

How are you going to convince websites to ping your servers anytime there has been a change on their website?

My sitemap index file links to 67 sitemaps, each of which have ~100,000 URLs each. If there have been some big improvements and changes to just a few of my pages, I don't want to ping you with a link to my sitemap file so that you will recrawl 5 million pages.

borobhaisab 117 Posting Whiz · Answer 24 · 2023-05-22T14:57:35+00:00

@dani

By providing an UPDATE link where you just submit your 5 links that got updated.
Now, that was not too hard to come-up with was it ? Very basic logic. Lol!

Dani 4,653 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 25 · 2024-03-28T04:55:17+00:00

Dani 4,653 The Queen of DaniWeb

1 Year Ago

Just genuinely curious what ever became of this project?