Thank you for being a valued part of the CNET community. As of December 1, 2020, the forums are in read-only format. In early 2021, CNET Forums will no longer be available. We are grateful for the participation and advice you have provided to one another over the years.

Thanks,

CNET Support

Question

What’s the Most Mentioned Scanner on CNET Forum?

Nov 16, 2015 3:17PM PST

As an usual CNET visitor, I used its forum from time to time, to solve hardware or software issue on my laptop & mobile. Someday, this question just poped up in my mind: what’s the most mentioned scanner on CNET forum? As I am working in a document scanning software company, it’s not much of a surprise for me to have such a question, is it?

I couldn’t get rid of this idea from my head and started to think about a solution. To answer this question, it seems that I need to gather data from each single discussion thread in CNET forum. But I soon found it unnecessary. Suppose that I get an issue with my scanner and need to ask for a solution on CNET, it seems only reasonable for me to go to one of these two sub-forums: “PC Hardware” or “Peripherals”, as shown below.

http://twainscanning.com/wp-content/uploads/2015/11/cnet-two-sub-forums.jpg

So instead of browsing all the sub-forums, it seems safe for me to just check these two sub-forums. By doing so, I will probably get more than 90% of the correct data for further analysis, and cut 90% of time for the data collection.

Link removed by moderator.

Post was last edited on November 16, 2015 3:23 PM PST

Discussion is locked

- Collapse -
Answer
I never found it to be just one.
Nov 16, 2015 3:46PM PST
- Collapse -
Answer
I started out by writting a script to scrape these two
Nov 19, 2015 7:16PM PST

I started out by writting a Python script to scrape these two sub-forums, until I found another problem. How can I identify a scanner model within a discussion thread? I guess some semantic analysis and semantic search (maybe nltk ?) can help here. The problem is that I have not much knowledge about them. There is an easier workaround though: regular expression.

To use regular expression to help, ideally I need to construct a regular expression to match all scanners that were mentioned in the forum posts. There are probably two ways to construct such a regular expression. One is to find out all the scanner brand name in the market, combined with model name, for the regular expression. The other way is only to find the keyword “scanner” + model name or model name + “scanner”. The model name is mostly a combination of digits and letters, like C9900A. The first way (scanner brand name “Canon, Epson, etc.” + model) is thorough, and the second way (keyword “scanner” + model) is simpler.

You bet. I chose the simple way. Happy

The regular expression I ended up with is:

(?i)((?:\w+\s\w+\s(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)\s(\w+\s){0,1}(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner))|(?:(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner)\s(\w+\s){1,2}(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)))

It is not as complex as it might look at first sight. It matches either of the following:

two words, then model number (including all-in-one), then “scanner”
“scanner”, then one or two words, then model number (including all-in-one)

(?:&hellipWink is used to create an unreferenced group. By “scanner”, I mean one of the following: “scanner”, “document scanner”, “photo scanner”, etc.

Read more on What’s the Most Mentioned Scanner on CNET Forum