Palai

Palai

程序员 | 开源爱好者 | 喜欢交友

An intern is like a brick, they go wherever they are needed.

Background: The project I was responsible for didn't have much activity recently, so after a few days of slacking off, my manager told me to work on a small requirement.

Then a product manager contacted me and asked me to implement a web crawling requirement: to upload the content from emails to our company's platform.

Hmm, that's the small requirement, it's a bit frustrating. Not because of technical issues, but... (To find out what happened next, please continue reading)

Me: Are there any restrictions on this email? Can I use the web version?
A: There are no specific requirements, so it's possible.
Me: Is there an existing API for uploading to the platform?
A: You can ask ** (the person in charge of the project's web crawling module), he knows.

After understanding the details, I started working on it.

I used my own 163 email for testing because I am not very proficient in both Java and Python.
After some research, I first considered using Selenium for automated web browsing. I just needed to control it through fixed buttons.
First step: Login
Using Selenium to simulate login resulted in being banned. It was detected by Netease's anti-crawling mechanism because it was too fast.
Solution: Adding a sleep(time) function.
Second step: Identifying commonalities
Since I needed to implement batch operations, the code needed to be more versatile.
Implementation process: Login - click on the logo for unread emails (no need to crawl again if already read) - first email - parse content - go back - second email - parse content - go back - and so on...
After parsing the content, I wanted to implement a back button. At this point, an unknown issue occurred. I tried many methods, but none of them were successful. (It was ridiculous) But it was already Friday... hehe

After a weekend, I encountered another problem with parsing the content. The crawled content was empty.
I checked the issue using F12 and found that the frontend framework had changed (a false alarm). It just needed to be re-parsed.
The previously mentioned issue was still unresolved and it was quite frustrating. So, I started using the second method: directly obtaining the entire page's content and using a loop to control batch operations. Finally, I parsed the entire content.
That's how I completed this small requirement.

Me: Hello, does this meet your expectations?
A: No, I don't need the entire content, I only need specific links. (They shared the specific process with me)
Me: Oh, I understand now. (Actually, it was just about crawling specific reports from their emails and providing the special links to the web crawler) (I'm amazed, couldn't they have shown me this from the beginning?) This has nothing to do with uploading.
Me: So, in what format should I provide the links to the web crawler? Text or something else?
A: You can ask the web crawler guy, he knows.
Me: Okay (^_^)

Finally, I stored the crawled links in a local Excel file in the format of Link: Timestamp.

Lesson learned: Always clarify the requirements, and if necessary, ask for a direct demonstration, because you never know if what the product manager says matches what you have in mind. (^_^)

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.