본문 바로가기

Language/python

python :: BeautifulSoup Encoding 에러

BeautifulSoup로 다음과 같이 읽어드리면 utf-8로 자동 인코딩이 된다.

In [1]: file_name = "d:\\sdd_word\\result\\xml\\class_c_c_i_status_manager.xml"

In [2]: f = open(file_name, 'r')

In [3]: contents = f.read()

In [4]: from BeautifulSoup import Beau
BeautifulSoup      BeautifulSOAP      BeautifulStoneSoup

In [4]: from BeautifulSoup import BeautifulSoup

In [5]: soup = BeautifulSoup(contents)

In [6]: soup.originalEncoding
Out[6]: u'utf-8'

이때 stdout으로 출력하는 soup('memberdef') 검색 시도중 아래와 같은 오류가 발생하였다.

In [31]: soup.find('memberdef')
Out[31]: -----------------------------------------------------------------------
----
UnicodeEncodeError                        Traceback (most recent call last)

C:\Documents and Settings\70735\<ipython console> in <module>()

D:\Developer\Python_26\lib\site-packages\IPython\Prompts.pyc in __call__(self, a
rg)
    550
    551             # and now call a possibly user-defined print mechanism

--> 552             manipulated_val = self.display(arg)
    553
    554             # user display hooks can change the variable to be stored in



D:\Developer\Python_26\lib\site-packages\IPython\Prompts.pyc in _display(self, a
rg)
    576             return IPython.generics.result_display(arg)
    577         except TryNext:
--> 578             return self.shell.hooks.result_display(arg)
    579
    580     # Assign the default display method:


D:\Developer\Python_26\lib\site-packages\IPython\hooks.pyc in __call__(self, *ar
gs, **kw)
    139             #print "prio",prio,"cmd",cmd #dbg

    140             try:
--> 141                 ret = cmd(*args, **kw)
    142                 return ret
    143             except ipapi.TryNext, exc:

D:\Developer\Python_26\lib\site-packages\IPython\hooks.pyc in result_display(sel
f, arg)
    169
    170     if self.rc.pprint:
--> 171         out = pformat(arg)
    172         if '\n' in out:
    173             # So that multi-line strings line up with the left column of



D:\Developer\Python_26\lib\pprint.pyc in pformat(self, object)
    109     def pformat(self, object):
    110         sio = _StringIO()
--> 111         self._format(object, sio, 0, 0, {}, 0)
    112         return sio.getvalue()
    113

D:\Developer\Python_26\lib\pprint.pyc in _format(self, object, stream, indent, a
llowance, context, level)
    127             self._readable = False
    128             return
--> 129         rep = self._repr(object, context, level - 1)
    130         typ = _type(object)
    131         sepLines = _len(rep) > (self._width - 1 - indent - allowance)

D:\Developer\Python_26\lib\pprint.pyc in _repr(self, object, context, level)
    221     def _repr(self, object, context, level):
    222         repr, readable, recursive = self.format(object, context.copy(),
--> 223                                                 self._depth, level)
    224         if not readable:
    225             self._readable = False

D:\Developer\Python_26\lib\pprint.pyc in format(self, object, context, maxlevels
, level)
    233         and whether the object represents a recursive construct.
    234         """
--> 235         return _safe_repr(object, context, maxlevels, level)
    236
    237

D:\Developer\Python_26\lib\pprint.pyc in _safe_repr(object, context, maxlevels,
level)
    318         return format % _commajoin(components), readable, recursive
    319
--> 320     rep = repr(object)
    321     return rep, (rep and not rep.startswith('<')), False
    322

UnicodeEncodeError: 'ascii' codec can't encode characters in position 305-306: o
rdinal not in range(128)

In [32]:

soup('memberdef')의 결과로 출력되는 리스트를 화면에 뿌리면서 인코딩 에러가 발생한 것이다.
lists = soup('memberdef') 와 같이 stdout으로 출력을 하지 않는다면 위와 같은 문제는 발생하지 않는다.
어쨌든 해당 리스트의 내용을 화면상으로 확인할 방법이 필요한데 이때는 아래와 같이 한다.
renderContents(코드명)을 사용한다.

In [39]: lists.renderContents('cp949')
Out[39]: '\n<type>struct sockaddr_in</type>\n<definition>struct sockaddr_in CMsg
Handler::m_sServAddr</definition>\n<argsstring></argsstring>\n<name>m_sServAddr<
/name>\n<briefdescription>\n<para>\xbc\xad\xb9\xf6 \xc1\xd6\xbc\xd2 \xba\xaf\xbc
\xf6 </para> </briefdescription>\n<detaileddescription>\n</detaileddescription>\
n<inbodydescription>\n</inbodydescription>\n<location file="C:/JTDLS_IDMS2/MsgHa
ndler/CMsgHandler.h" line="45" bodyfile="C:/JTDLS_IDMS2/MsgHandler/CMsgHandler.h
" bodystart="45" bodyend="-1"></location>\n'