728x90
spongebob.fandom.com/wiki/Encyclopedia_SpongeBobia
์คํฐ์ง๋ฐฅ์ ๋์ฌ๋ฅผ ์ถ์ถํ๊ธฐ ์ํด Season๋ณ ํ์ดํ๊ณผ ๊ทธ ๋์ฌ๊ฐ ๋ด๊ธด ์ฌ์ดํธ๋ฅผ ํฌ๋กค๋งํ๋ค.
import re
import pandas as pd
from urllib.request import urlopen
import glob
pd.read_html๋ก ๋ฐ์ดํฐ ์ถ์ถ
o_site = 'https://spongebob.fandom.com/wiki/List_of_transcripts'
season1 = pd.read_html(o_site,header=0)[0]
season1.columns # '#', 'Title', 'Transcript'
season13 ๊น์ง์ ์ ๋ชฉ๊ณผ ์ฃผ์๋ฅผ ๋ด๋๋ค.
for i in range(13):
total_site = []
site = pd.read_html(o_site,header=0)[i]
for title in site.Title:
temp = {'title':title,
'addr':str('https://spongebob.fandom.com/wiki/{}/transcript'.format(re.sub(' ','_',title)))}
total_site.append(temp)
globals()['season'+str(i+1)] = pd.DataFrame(total_site)
season ๋ณ ์ ๋ชฉ๊ณผ ๋ด์ฉ ์ถ์ถ ํ text๋ก ์ ์ฅ
from bs4 import BeautifulSoup
import requests
for season in range(13):
season = globals()['season'+str(season+1)]
for title,addr in zip(season.title,season.addr):
html = requests.get(addr).text
soup = BeautifulSoup(html, 'html.parser')
text = []
for ea in soup.select('.mw-parser-output > ul'):
text.append(ea.text)
with open('{}.txt'.format(re.sub('\?','',str(title).replace(' ','_'))),'w') as f:
for line in text:
f.write(line)
๋ฐ์ํ
'๐ Python' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Code-Server] ์ฝ๋ ์๋ฒ์์ ์ฃผํผํฐ ๋ ธํธ๋ถ ์ฌ์ฉํ๊ธฐ (0) | 2021.07.11 |
---|---|
[Code-Server] import-im6.q16: unable to open X server ์๋ฌ (0) | 2021.07.11 |
[Jupyter Notebook] ์ฃผํผํฐ ๋ ธํธ๋ถ ์ ์คํฌ๋ฆฝํธ ๋๋น ์กฐ์ (cell script option), ํ๋ค์ค ๋๋น ์กฐ์ (0) | 2021.03.21 |
[Pytorch] Autograd (0) | 2021.02.28 |
Matplotlib ํ๊ธ ํฐํธ ์ ์ฉ (0) | 2021.02.13 |